{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction: Automated GBM Hyperparameter Optimization\n",
    "\n",
    "In this notebook we will walk through automated hyperparameter tuning using Bayesian Optimization. Specifically, we will optimize the hyperparameters of a Gradient Boosting Machine using the Hyperopt library (with the Tree Parzen Estimator algorithm).  We will compare the results of random search (implemented manually) for hyperparameter tuning with the Bayesian model-based optimization method to try and understand how the Bayesian method works and what benefits it has over uninformed search methods. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperopt\n",
    "\n",
    "Hyperopt is one of several automated hyperparameter tuning libraries using Bayesian optimization. These libraries differ in the algorithm used to both construct the surrogate (probability model) of the objective function and choose the next hyperparameters to evaluate in the objective function. Hyperopt uses the Tree Parzen Estimator (TPE). Other Python libraries include Spearmint, which uses a Gaussian process for the surrogate, and SMAC, which uses a random forest regression. \n",
    "\n",
    "Hyperopt has a simple syntax for structuring an optimization problem which extends beyond hyperparameter tuning to any problem that involves minimizing a function. Moreover, the structure of a Bayesian Optimization problem is similar across the libraries, with the major differences coming in the syntax (and in the algorithms behind the scenes that we do not have to deal with). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Modeling\n",
    "import lightgbm as lgb\n",
    "\n",
    "# Evaluation of the model\n",
    "from sklearn.model_selection import KFold\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "MAX_EVALS = 2\n",
    "N_FOLDS = 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "For this notebook we will work with only the `application` data. The methods developed here can work on any dataset but the run-times can be long! We will also separate the training set into training and testing to evaluate the performance of hyperparameter tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train shape:  (246008, 120)\n",
      "Test shape:  (61503, 120)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NAME_CONTRACT_TYPE</th>\n",
       "      <th>CODE_GENDER</th>\n",
       "      <th>FLAG_OWN_CAR</th>\n",
       "      <th>FLAG_OWN_REALTY</th>\n",
       "      <th>CNT_CHILDREN</th>\n",
       "      <th>AMT_INCOME_TOTAL</th>\n",
       "      <th>AMT_CREDIT</th>\n",
       "      <th>AMT_ANNUITY</th>\n",
       "      <th>AMT_GOODS_PRICE</th>\n",
       "      <th>NAME_TYPE_SUITE</th>\n",
       "      <th>...</th>\n",
       "      <th>FLAG_DOCUMENT_18</th>\n",
       "      <th>FLAG_DOCUMENT_19</th>\n",
       "      <th>FLAG_DOCUMENT_20</th>\n",
       "      <th>FLAG_DOCUMENT_21</th>\n",
       "      <th>AMT_REQ_CREDIT_BUREAU_HOUR</th>\n",
       "      <th>AMT_REQ_CREDIT_BUREAU_DAY</th>\n",
       "      <th>AMT_REQ_CREDIT_BUREAU_WEEK</th>\n",
       "      <th>AMT_REQ_CREDIT_BUREAU_MON</th>\n",
       "      <th>AMT_REQ_CREDIT_BUREAU_QRT</th>\n",
       "      <th>AMT_REQ_CREDIT_BUREAU_YEAR</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>44645</th>\n",
       "      <td>Cash loans</td>\n",
       "      <td>F</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>0</td>\n",
       "      <td>135000.0</td>\n",
       "      <td>490495.5</td>\n",
       "      <td>36027.0</td>\n",
       "      <td>454500.0</td>\n",
       "      <td>Unaccompanied</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149709</th>\n",
       "      <td>Cash loans</td>\n",
       "      <td>F</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>1</td>\n",
       "      <td>99000.0</td>\n",
       "      <td>450000.0</td>\n",
       "      <td>46242.0</td>\n",
       "      <td>450000.0</td>\n",
       "      <td>Spouse, partner</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9290</th>\n",
       "      <td>Cash loans</td>\n",
       "      <td>F</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>0</td>\n",
       "      <td>148500.0</td>\n",
       "      <td>571446.0</td>\n",
       "      <td>18562.5</td>\n",
       "      <td>477000.0</td>\n",
       "      <td>Unaccompanied</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>279223</th>\n",
       "      <td>Cash loans</td>\n",
       "      <td>M</td>\n",
       "      <td>Y</td>\n",
       "      <td>N</td>\n",
       "      <td>1</td>\n",
       "      <td>157500.0</td>\n",
       "      <td>331920.0</td>\n",
       "      <td>16096.5</td>\n",
       "      <td>225000.0</td>\n",
       "      <td>Unaccompanied</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35335</th>\n",
       "      <td>Cash loans</td>\n",
       "      <td>M</td>\n",
       "      <td>Y</td>\n",
       "      <td>Y</td>\n",
       "      <td>1</td>\n",
       "      <td>225000.0</td>\n",
       "      <td>1125000.0</td>\n",
       "      <td>33025.5</td>\n",
       "      <td>1125000.0</td>\n",
       "      <td>Unaccompanied</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 120 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  \\\n",
       "44645          Cash loans           F            N               Y   \n",
       "149709         Cash loans           F            N               Y   \n",
       "9290           Cash loans           F            N               Y   \n",
       "279223         Cash loans           M            Y               N   \n",
       "35335          Cash loans           M            Y               Y   \n",
       "\n",
       "        CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  \\\n",
       "44645              0          135000.0    490495.5      36027.0   \n",
       "149709             1           99000.0    450000.0      46242.0   \n",
       "9290               0          148500.0    571446.0      18562.5   \n",
       "279223             1          157500.0    331920.0      16096.5   \n",
       "35335              1          225000.0   1125000.0      33025.5   \n",
       "\n",
       "        AMT_GOODS_PRICE  NAME_TYPE_SUITE             ...              \\\n",
       "44645          454500.0    Unaccompanied             ...               \n",
       "149709         450000.0  Spouse, partner             ...               \n",
       "9290           477000.0    Unaccompanied             ...               \n",
       "279223         225000.0    Unaccompanied             ...               \n",
       "35335         1125000.0    Unaccompanied             ...               \n",
       "\n",
       "       FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21  \\\n",
       "44645                 0                0                0                0   \n",
       "149709                0                0                0                0   \n",
       "9290                  0                0                0                0   \n",
       "279223                0                0                0                0   \n",
       "35335                 0                0                0                0   \n",
       "\n",
       "        AMT_REQ_CREDIT_BUREAU_HOUR  AMT_REQ_CREDIT_BUREAU_DAY  \\\n",
       "44645                          0.0                        0.0   \n",
       "149709                         NaN                        NaN   \n",
       "9290                           0.0                        0.0   \n",
       "279223                         0.0                        0.0   \n",
       "35335                          0.0                        0.0   \n",
       "\n",
       "        AMT_REQ_CREDIT_BUREAU_WEEK  AMT_REQ_CREDIT_BUREAU_MON  \\\n",
       "44645                          0.0                        0.0   \n",
       "149709                         NaN                        NaN   \n",
       "9290                           0.0                        1.0   \n",
       "279223                         0.0                        0.0   \n",
       "35335                          0.0                        0.0   \n",
       "\n",
       "        AMT_REQ_CREDIT_BUREAU_QRT  AMT_REQ_CREDIT_BUREAU_YEAR  \n",
       "44645                         0.0                         1.0  \n",
       "149709                        NaN                         NaN  \n",
       "9290                          0.0                         2.0  \n",
       "279223                        1.0                         0.0  \n",
       "35335                         0.0                         2.0  \n",
       "\n",
       "[5 rows x 120 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read in data and separate into training and testing sets\n",
    "features = pd.read_csv('../input/application_train.csv')\n",
    "\n",
    "# Extract the labels and format properly\n",
    "labels = np.array(features['TARGET'].astype(np.int32)).reshape((-1,))\n",
    "\n",
    "# Drop the unneeded columns\n",
    "features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])\n",
    "\n",
    "train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.2)\n",
    "\n",
    "print('Train shape: ', train_features.shape)\n",
    "print('Test shape: ', test_features.shape)\n",
    "\n",
    "train_features.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gradient Boosting Machine Default Model\n",
    "\n",
    "We will use the LightGBM implementation of the gradient boosting machine. This is much faster than the Scikit-Learn implementation and achieves results comparable to extreme gradient boosting, XGBoost. For the baseline model, we will use the default hyperparameters as specified in LightGBM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,\n",
       "        learning_rate=0.1, max_depth=-1, min_child_samples=20,\n",
       "        min_child_weight=0.001, min_split_gain=0.0, n_estimators=100,\n",
       "        n_jobs=-1, num_leaves=31, objective=None, random_state=50,\n",
       "        reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,\n",
       "        subsample_for_bin=200000, subsample_freq=1)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = lgb.LGBMClassifier(random_state=50)\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All we need to do is fit the model on the training data and make predictions on the testing data. For the predictions, because we are measuring ROC AUC and not accuracy, we have the model predict probabilities and not hard binary values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Encoding Categorical Variables\n",
    "\n",
    "For the default model, we will use one-hot encoding of categorical variables. The gradient boosting machine can handle categorical variables that are integer encoded, which we can look at incorporating as one of our hyperparameters over which to search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of features:  242\n"
     ]
    }
   ],
   "source": [
    "one_hot_train = pd.get_dummies(train_features)\n",
    "one_hot_test = pd.get_dummies(test_features)\n",
    "\n",
    "one_hot_train, one_hot_test = one_hot_train.align(one_hot_test, axis = 1, join = 'inner')\n",
    "one_hot_features = list(one_hot_train.columns)\n",
    "\n",
    "print('Number of features: ', one_hot_train.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   21.9s finished\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import roc_auc_score\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from timeit import default_timer as timer\n",
    "\n",
    "start = timer()\n",
    "cv = cross_val_score(estimator = model, X = one_hot_train, \n",
    "                     y = train_labels, \n",
    "                     verbose = 2, n_jobs = -1, \n",
    "                     cv = N_FOLDS, scoring = 'roc_auc')\n",
    "cv_time = timer() - start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Baseline model 2 fold cv AUC ROC score: 0.75247\n",
      "Baseline model eval time: 22.52 seconds.\n"
     ]
    }
   ],
   "source": [
    "print('Baseline model {} fold cv AUC ROC score: {:.5f}'.format(N_FOLDS, np.mean(cv)))\n",
    "print('Baseline model eval time: {:.2f} seconds.'.format(cv_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's out metric to beat! Of course, this performance will not necessarily translate to the comptition data, but we can use it for comparison purposes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Label Encoding\n",
    "\n",
    "The other option we have for categorical features is label encoding. This maps each different value of a categorical feature to an integer. After we convert the values, we tell the model the categorical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "le = LabelEncoder()\n",
    "\n",
    "le_train = train_features.copy()\n",
    "le_test = test_features.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of features:  120\n"
     ]
    }
   ],
   "source": [
    "cat_features = []\n",
    "\n",
    "for i, col in enumerate(le_train):\n",
    "    if le_train[col].dtype == 'object':\n",
    "        le_train[col] = le.fit_transform(np.array(le_train[col].astype(str)).reshape((-1, )))\n",
    "        le_test[col] = le.transform(np.array(le_test[col].astype(str)).reshape((-1, )))\n",
    "        cat_features.append(i)\n",
    "        \n",
    "print('Number of features: ', le_train.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   20.4s finished\n"
     ]
    }
   ],
   "source": [
    "start = timer()\n",
    "cv = cross_val_score(estimator = model, X = le_train, y = train_labels, \n",
    "                     fit_params = {'categorical_feature': cat_features},\n",
    "                     verbose = 2, n_jobs = -1, cv = N_FOLDS, \n",
    "                     scoring = 'roc_auc')\n",
    "cv_time = timer() - start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Baseline model with label encoding 2 fold cv AUC ROC score: 0.75171\n",
      "Baseline model with label encoding eval time: 20.90 seconds.\n"
     ]
    }
   ],
   "source": [
    "print('Baseline model with label encoding {} fold cv AUC ROC score: {:.5f}'.format(N_FOLDS, np.mean(cv)))\n",
    "print('Baseline model with label encoding eval time: {:.2f} seconds.'.format(cv_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Random Search \n",
    "\n",
    "First we will implement a common technique for hyperparameter optimization: random search. Each iteration, we choose a random set of model hyperparameters from a search space. Empirically, random search is very effective, returning nearly as good results as grid search with a significant reduction in time spent searching. However, it is still an uninformed method in the sense that it does not use past evaluations of the objective function to inform the choices it makes for the next evaluation. \n",
    "\n",
    "Random search uses the following four parts, which also are used in Bayesian hyperparameter optimization:\n",
    "\n",
    "1. Domain: values over which to search\n",
    "2. Optimization algorithm: pick the next values at random! (yes this qualifies as an algorithm)\n",
    "3. Objective function to minimize: in this case our metric is cross validation ROC AUC\n",
    "4. Results history that tracks the hyperparameters tried and the cross validation metric\n",
    "\n",
    "Random search can be implemented in the Scikit-Learn library using `RandomizedSearchCV`, however, because we are using Early Stopping (to determine the optimal number of estimators), we will have to implement the method ourselves (more practice!). This is pretty straightforward, and many of the ideas in random search will transfer over to Bayesian hyperparameter optimization. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Domain for Random Search\n",
    "\n",
    "Random search and Bayesian optimization both search for hyperparameters from a domain. For random (or grid search) this domain is called a hyperparameter grid and uses discrete values for the hyperparameters.\n",
    "\n",
    "First, let's look at all of the hyperparamters that need to be tuned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,\n",
       "        learning_rate=0.1, max_depth=-1, min_child_samples=20,\n",
       "        min_child_weight=0.001, min_split_gain=0.0, n_estimators=100,\n",
       "        n_jobs=-1, num_leaves=31, objective=None, random_state=None,\n",
       "        reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,\n",
       "        subsample_for_bin=200000, subsample_freq=1)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lgb.LGBMClassifier()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the default values, we can construct the following hyperparameter grid. It's difficult to say ahead of time what choices will work best, so we will use a wide range of values centered around the default for most of the hyperparameters. \n",
    "\n",
    "The `subsample_dist` will be used for the `subsample` parameter but we can't put it in the param grid because `boosting_type=goss` does not support row subsampling. Therefore we will use an `if` statement when choosing our hyperparameters to choose a subsample ratio if the boosting type is not `goss`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Hyperparameter grid\n",
    "param_grid = {\n",
    "    'class_weight': [None, 'balanced'],\n",
    "    'boosting_type': ['gbdt', 'goss', 'dart'],\n",
    "    'num_leaves': list(range(30, 150)),\n",
    "    'learning_rate': list(np.logspace(np.log(0.005), np.log(0.2), base = np.exp(1), num = 1000)),\n",
    "    'subsample_for_bin': list(range(20000, 300000, 20000)),\n",
    "    'min_child_samples': list(range(20, 500, 5)),\n",
    "    'reg_alpha': list(np.linspace(0, 1)),\n",
    "    'reg_lambda': list(np.linspace(0, 1)),\n",
    "    'colsample_bytree': list(np.linspace(0.6, 1, 10)),\n",
    "    'encoding': ['one_hot', 'label']\n",
    "}\n",
    "\n",
    "# Subsampling (only applicable with 'goss')\n",
    "subsample_dist = list(np.linspace(0.5, 1, 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at two of the distributions, the `learning_rate` and the `num_leaves`. The learning rate is typically [represented by a logarithmic distribution](https://www.quora.com/Why-does-one-sample-the-log-space-when-searching-for-good-Hyper-Parameters-for-Machine-Learning) because it can vary over several orders of magnitude. `np.logspace` returns values evenly spaced over a log-scale (so if we take the log of the resulting values, the distribution will be uniform.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEeCAYAAACZlyICAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XmYXFW57/HvjymACIQkDIahw6ACgoARUcQLBFRQgSOgcESGwxEFfETRoyieI3gExOGCqBduFGUQBUQRRLmKhDgdGcIUGZSEECAQkjBDgDC994+1KtmpVHev6lRXVXf/Ps9TT++99tprv7Vrd721154UEZiZmfVnhU4HYGZmQ4MThpmZFXHCMDOzIk4YZmZWxAnDzMyKOGGYmVkRJwwbVJJ6JIWkkzody0jTiXXfaJmd2ga87bWeE0YXkrRr3tA/1+lYhgtJJ+V1Wnu9KulxSddK2qdF7e/Xilh7ab8+9mckzZJ0uaQjJK3W4uUdLunTrWxzMOSkcJKk7Tody0iwUqcDsGHvfmA14OVOB5L9F3AfadvfDPg4cIWkQyLiouVo9yvA+cCvlj/EXt0GfDsPrw5sDLwb+BFwoqT9I+L2Sv3lWfeHAz3AmU3O1+7Pu4e07meT1k8nYxn2nDCsmKTXRsQzzcwT6VYCLwxSSANxdURMq41Iuoz0RXMCsDwJox0eioif1JV9WdKBpNivlrR1RDwB7V33tW2jmz7vbopluHCX1BAnaZSkL0m6U9ILkp6U9GtJ29fVW0HSiZL+JOkRSS9KekDS2ZLG1NVd3Pcr6cOSbpb0PPDdPP28PH2tPP/8vOy/Snpbb2310v77Jd2U558r6ZuSlvkhI2l/Sbfneg9I+oqkPXI7hw90/eVf5I8CWzRY5jGSfi/poby+5kr6iaSe+veSRw+rdh3VtbVHbuvJ/B6mS/rEQOOuew8/B74BbAAcWx9bfR++pEMl3ZhjWZi7ti6SNC5Pnw38L2CTuq6wXfP0qZJmS9pU0mWSHgee7muZlWUfnN977XM8qf7zrrXfYN6l2s6f+3V58o8rcU7t5/2vJOkLku7KcTym1LW3TW/LK91Oh7sR94aHE0krA/8PeAdwIfA9YC3gY8BfJb2r8mt6FeA/gF8AVwALgbcCRwLvlPSWiHixbhH7AZ8CzgbOIX8pVPwOWAB8FRgDHA/8VlJP4Z7I3sAxue0fAfsCnwOeAE6tvM8PAz8D7gVOJnUxHAZ8oGAZfZI0GhgNzG8w+XPA9cBZwOPAm4B/B3aXtE1EPEZ6/x8lrf8/A5MbLOOo/B6vB04hrfs9gbMlbRYR/7G87wP4IXAi8D7ga71VknQIqevsz6TuuedJXVt7Aevm9/Np4DRgLPCZyux3V4bXAP4I/DUvd92CGD+Q2/4+8AiwD6k7aRPgiIL56/2JtJ18ibTe/5zL5/Uz30XAh4BrSNv2+qRE+zdJu0TErXX1i7bTESEi/OqyF7ArEMDn+qn3mVzvPXXlawIPAFMrZQJWa9DGkbmND1XKenLZS8CWDeY5L0//P3XlB+byjzdo66QGZQuBnroY7wDmVspWAh4ifQmMrpSvAczK7RxesE5PynUnkb4I1wd2Jv1CDeAbDeZ5TYOySbn+5+vKAzivQf0NSN0iP20w7TvAK8BmBfEHcFU/dZ4GHutn3f8y11upn7amArP7mBbA1xpM6+vzfgXYoe7zvjxP26m/ZffS9q69bQO91N8zl10CqFK+LemHyJ8Hsp2OlJe7pIa2Q4B/ADdLGlt7kfYmriHtOawGqT83Ip4HkLSipLVz3Sm5rbc1aP83EXF3g/KaM+rGa20t073Ti19FxOzaSKT/xuuA9SWtkYvfAryO9GX8RKXus6RffM36A+lX9FzgL8DbgdNJv1KXEhELYXF33lp5fd0OPEXj9dXIAcAo4NzqZ5Tb+jWpW3jSAN5HI0+Tfiz05SnSAfP3SdJyLu9bTda/JiJuqY3kz/sbefRfljOWUrXlnJKXX4tlOnAV6X9mXN08JdvpiOAuqaFtS9JZIAv6qDMWeBBA0oeAzwLbAyvX1RvdYN57+ln+rOpIRDyWv4PGNK7e9/zZY/nvGOBZYEIe/2eDuo3K+nMs6X2tDuxG6nIbHRHLnEkjaXdSt83bgFXrJjdaX41smf/+oY866xW21Z81WbbbsN6pwLtIZ3M9JumPwNXAJdHcCQ0LIuLJJuNr9OPjrvx30ybbGqgJwKu9xHIHqbtpAkv/T5VspyOCE8bQJuDvpGMHvVkAIOmDpN3wG4HjSEnkBWBF0nGQRnubz/W18Ih4pY+4SvQ2f7WN5f0VXO/GWHJc50pJ84DTJN0aEYv3WCS9Ffg9MJN0BtV9pP7+AC6m/ISRWvyHkvZqGmn0hdSUfCD+tcDf+qoXETMkbUXaq5lEOrj9A+DkfMzr3sJF9rlt9Lb45azXiu+rgWxPJdvpiOCEMbTNAMYBUyLi1X7qfpSUIHaLiMX/7JLeOIjxtcJ9+e8bGkxrVNasb5OO43xN0k8jovYL/V9JyXSviKjFgKTXUL53AekzAng0Ivray1he/57//qa/ihGxCPhtfiFp7zzf8Sw5y2ownqy2VR9l1aT5OKkrsl6jvZBm47wXeA9pz296L7HchzXkYxhD2wWkg7cN9zAkVbs6XiH9c61QmS7gy4MZYAtMI/0yPzyf0QRA7jte7tNSI+IlUjfNGFL3VE3tV2X9L8gv0fj/5llgnQbllwKLSL/gl7kaOx8bGdVs3HVtHAh8HniYdAZSX3XHNiiuHVeoxv8sMLoFxzmq9pS0QyUWkeKGpS94vAd4raQdK3VXYOkztqpxQuN130htOV+svjdJbyKdtfWXiOiri3dE8x5Gd5skqb7vHNKv1XNIZ9nsCXwz97dPIfVhb0zqbniB1E8PcBmwPzBF0gWkYxj7kfryu1ZEvKx0i5SLgBslnUs6m+VwUj/yBJb/1/CFpGMVx0v6bkQ8RTp75zOk04QnAy+S1vW2pOs26l0P7CHpC6Qz1CIiLo6IOZKOJp32erekC0lXII8DtiF9BluRrlTuz/h8WiykY1e1K713JHWdfbDguMLvJT1FOiX1QWBt0rqMvB6q7+f9wPck/Q8pgU6JiEanH5e6nbT9fZ/0I2BfYA/gwoiodqVNJh1ru1zSd0jr/gAaf1/dBTwDHCPpOeBJYH5ETGlQl4i4RtKlwEGkhHgVS06rfYGlfzRYvU6fpuXXsi+WnCrY2+sflborkTbym0in/y0kdYNcBLy7rt2Pkf7BXiD9w04m/TJb6pRQGpyOWNfOeeSTRRpM67etvtpnyemvPXXlHyJ1ISwifSF/hXTGy1KnBPexTmvtTuxl+sfz9K9UyvYDbs7r9FHSsYuNSV/uU+vm34J0zOPp2udUN31nUhKaT/oCfJh0ps1ngVUL4q/fBp4ldZ38Cvg3Gp8y3Wjdf4x0Bt0jOY65pK6p3ermfQ1wLul05tre6a552lR6P+W2z88bOLjyOT5IuoZn5Qbt7E26An9RXlenk7ogl9luct1bSNt11D6b3rYz0v/MF0gHvheRusB+BWzT33vpbzsd7i/lN2825Ej6LOnUzrdHxPWdjsdsuHPCsK4naRXglaiclZWPYUwnnUr6ulj2KnUzazEfw7ChYFPSjfUuJnXDbEC6NcgE4GgnC7P2cMKwoWAB6SDsR0j3LHqZdP3JCRFxaScDMxtJ3CVlZmZFhtUextixY6Onp6fTYZiZDSk333zzoxFRfw+tZQyrhNHT08O0adP6r2hmZotJur+knq/0NjOzIk4YZmZWxAnDzMyKOGGYmVkRJwwzMyvihGFmZkWcMMzMrIgThpmZFXHCMDOzIk4YWc/66yOpI6+e9dfv9Ns3M+vXsLo1yPK4f968QXnqfQnNm9ehJZuZlfMehpmZFXHCMDOzIk4YZmZWxAnDzMyKOGGYmVkRJwwzMyvihGFmZkWcMMzMrIgThpmZFWl7wpC0oqRbJV2VxydIukHSDEmXSFoll4/K4zPz9J52x2pmZkt0Yg/jOODuyvjpwBkRsQXwBHBkLj8SeCIiNgfOyPXMzKxD2powJG0IvA/4YR4XsDtwWa5yPrBfHt43j5OnT8r1zcysA9q9h3Em8Hng1Tw+BngyIl7O43OA8Xl4PPAgQJ7+VK5vZmYd0LaEIen9wPyIuLla3KBqFEyrtnuUpGmSpi1YsKAFkZqZWSPt3MPYGdhH0mzgYlJX1JnA2pJqt1nfEHg4D88BNgLI09cCHq9vNCImR8TEiJg4bty4wX0HZmYjWNsSRkR8MSI2jIge4CBgSkR8BLgOOCBXOwy4Ig9fmcfJ06dERKceWWFmNuJ1w3UYXwCOlzSTdIzi3Fx+LjAmlx8PnNCh+MzMjA49cS8ipgJT8/AsYMcGdV4ADmxrYGZm1qtu2MMwM7MhwAnDzMyKOGGYmVkRJwwzMyvihGFmZkWcMMzMrIgThpmZFXHCMDOzIk4YZmZWxAnDzMyKOGGYmVkRJwwzMyvihGFmZkWcMMzMrIgThpmZFXHCMDOzIk4YZmZWxAnDzMyKOGGYmVkRJwwzMyvihGFmZkWcMMzMrIgThpmZFXHCMDOzIk4YZmZWxAnDzMyKOGGYmVkRJwwzMyvihGFmZkWcMMzMrIgThpmZFXHCMDOzIk4YZmZWxAnDzMyKOGGYmVkRJwwzMyvihGFmZkWcMMzMrIgThpmZFWlbwpC0qqQbJd0u6U5JJ+fyCZJukDRD0iWSVsnlo/L4zDy9p12xmpnZstq5h7EI2D0i3gxsB7xX0k7A6cAZEbEF8ARwZK5/JPBERGwOnJHrmZlZh7QtYUTybB5dOb8C2B24LJefD+yXh/fN4+TpkySpTeGamVmdth7DkLSipNuA+cA1wL3AkxHxcq4yBxifh8cDDwLk6U8BY9oZr5mZLdHWhBERr0TEdsCGwI7Alo2q5b+N9iaivkDSUZKmSZq2YMGC1gVrZmZL6chZUhHxJDAV2AlYW9JKedKGwMN5eA6wEUCevhbweIO2JkfExIiYOG7cuMEO3cxsxGrnWVLjJK2dh1cD9gDuBq4DDsjVDgOuyMNX5nHy9CkRscwehpmZtcdK/VdpmQ2A8yWtSEpUl0bEVZLuAi6W9DXgVuDcXP9c4EJJM0l7Fge1MVYzM6vTtoQREdOB7RuUzyIdz6gvfwE4sA2hmZlZAV/pbWZmRZwwzMysiBOGmZkVccIwM7MiThhmZlbECcPMzIo4YZiZWREnDDMzK+KEYWZmRZwwzMysiBOGmZkVccIwM7MixQlD0rsqz62olq8k6V2tDcvMzLpNM3sY1wHrNChfK08zM7NhrJmEIRo8IpX0nO2FrQnHzMy6Vb/Pw5B0ZR4M4CeSFlUmrwi8CfifQYjNzMy6SMkDlB7LfwU8ATxfmfYi8BfgBy2Oy8zMuky/CSMijgCQNBv4VkS4+8nMbAQqfkRrRJw8mIGYmVl3K04YktYBTgEmAetSd8A8ItZsbWhmZtZNihMGcC6wPTAZeJjGZ0yZmdkw1UzCmATsGRE3DFYwZmbWvZq5DmM+8OxgBWJmZt2tmYRxIvBVSWsMVjBmZta9mumS+jLQA8yXdD/wUnViRGzbwrjMzKzLNJMwLhu0KMzMrOv5OgwzMyvi52GYmVmRZi7ce4Y+rr3whXtmZsNbM8cwPlk3vjLpQr79SVeAm5nZMNbMMYzzG5VLuoV0Ud93WxWUmZl1n1Ycw7gO+EAL2jEzsy7WioRxEPBoC9oxM7Mu1sxB77+z9EFvAeuRnvN9dIvjMjOzLrM8F+69CiwApkbEP1oXkpmZdSNfuGdmZkWa2cMAQNLuwFak7qk7I2Jqq4MyM7Pu08wxjPHA5cBbSA9QAnidpGnAv0TEw73ObGZmQ14zZ0mdBbwCbB4RG0XERsAWueyswQjOzMy6RzNdUnsCu0bEfbWCiJgl6VPAtS2PzMzMukorrsN4taSSpI0kXSfpbkl3Sjoul68j6RpJM/Lf0blcks6SNFPSdEk7tCBWMzMboGYSxrXAWZI2qhVI2hj4DmV7GC8Dn42ILYGdgGMlbQWcAFwbEVvkdk7I9fcidXltARwFnN1ErGZm1mLNJIxPAasDsyTdL2k2cG8u+1R/M0fE3Ii4JQ8/A9wNjAf2BWr3qTof2C8P7wtcEMn1wNqSNmgiXjMza6FmrsN4ENhB0p7AG0lXet8VEX9odqGSekh3ur0BWC8i5uZlzJW0bq42HniwMtucXDa3rq2jSHsgbLzxxs2GYmZmhfrdw5C0l6TZktYCiIhrIuK7EXEWcFOe9u7SBUpaA/gF8OmIeLqvqg3KlnkeR0RMjoiJETFx3LhxpWGYmVmTSrqkPgl8MyKeqp+Qy04HjitZmKSVScniooj4ZS6eV+tqyn/n5/I5wEaV2TdkyfUfZmbWZiUJY1ugr26nKcCb+2tEkoBzgbsj4n9XJl0JHJaHDwOuqJQfms+W2gl4qtZ1ZWZm7VdyDGMcfZ86G8CYgnZ2Bj4K/F3SbbnsS8DXgUslHQk8AByYp/0W2BuYCTwHHFGwDDMzGyQlCWMOaS9jRi/TtwUe6q+RiPgLjY9LQHpiX339AI4tiM/MzNqgpEvqN8B/S1qtfoKk1YGv5jpmZjaMlexhnAIcAMyQ9F2g9uyLLUkHxAWcOjjhmZlZt+g3YUTEfEnvIF1pfSpLupUC+B1wTETMG7wQzcysGxRduBcR9wN75/s8bU5KGjMi4onBDM7MzLpHUw9QygnipkGKxczMulgr7lZrZmYjgBOGmZkVccIwM7MiThhmZlbECcPMzIo4YZiZWREnDDMzK+KEYWZmRZwwzMysSFNXetvgGAWk50u11ybrrcfsRx5p+3LNbGhywugCi2jwsPI20DzfM9LMyrlLyszMijhhmJlZEScMMzMr4oRhZmZFnDDMzKyIE4aZmRVxwjAzsyJOGGZmVsQJw8zMijhhmJlZEScMMzMr4oRhZmZFnDDMzKyIE4aZmRVxwjAzsyJOGGZmVsQJw8zMijhhmJlZEScMMzMr4oRhZmZFnDDMzKyIE4aZmRVxwjAzsyJtSxiSfiRpvqQ7KmXrSLpG0oz8d3Qul6SzJM2UNF3SDu2K08zMGmvnHsZ5wHvryk4Aro2ILYBr8zjAXsAW+XUUcHabYjQzs160LWFExJ+Ax+uK9wXOz8PnA/tVyi+I5HpgbUkbtCdSMzNrpNPHMNaLiLkA+e+6uXw88GCl3pxctgxJR0maJmnaggULBjXY4WYUIKkjr5711+/02zezJnU6YfRGDcqiUcWImBwREyNi4rhx4wY5rOFlEWmlduJ1/7x5bXiHZtZKnU4Y82pdTfnv/Fw+B9ioUm9D4OE2x2ZmZhWdThhXAofl4cOAKyrlh+azpXYCnqp1XZmZWWes1K4FSfoZsCswVtIc4CvA14FLJR0JPAAcmKv/FtgbmAk8BxzRrjjNzKyxtiWMiDi4l0mTGtQN4NjBjcjMzJrR6S4pMzMbIpwwzMysiBOGmZkVccIwM7MiThhmZlbECcPMzIo4YZiZWZG2XYdhVlW78WG7bbLeesx+5JG2L9dsOHDCsI6o3fiw3eSbHpoNmLukzMysiBOGmZkVccIwM7MiThhmZlbECcPMzIr4LCkbUXw6r9nAOWHYiOLTec0Gzl1SZmZWxAnDzMyKOGGYmVkRH8Mwa4NOHWwHH3C31nHCMGuDTh1sBx9wt9Zxl5SZmRVxwjAzsyLukjIb5nyxorWKE4bZMOeLFa1VnDDMbFD4zLDhxwnDzAaFzwwbfpwwzGzY8XGbweGEYWbDTqf2bladN29Yd8M5YZiZtchw74bzdRhmZlbECcPMzIo4YZiZWREnDDMzK+KEYWZmRZwwzMysiBOGmZkVccIwM7MiThhmZlakqxOGpPdK+qekmZJO6HQ8ZmYjWdcmDEkrAt8H9gK2Ag6WtFVnozIzG7m6NmEAOwIzI2JWRLwIXAzs2+GYzMxGrG6++eB44MHK+BzgbfWVJB0FHJVHn5X0z17aGws82tcCO3OPycXL7je+QVpuiUGJrUXru+nY2vg5LxVbh7evem3Z3gb4nlsS2yCt727+HhkraaDrbZOSSt2cMBqt92VuBBkRk4HJ/TYmTYuIia0IbDB0c3yObWAc28A4toFpR2zd3CU1B9ioMr4h8HCHYjEzG/G6OWHcBGwhaYKkVYCDgCs7HJOZ2YjVtV1SEfGypE8CvwNWBH4UEXcuR5P9dlt1WDfH59gGxrENjGMbmEGPTRGdej6UmZkNJd3cJWVmZl3ECcPMzIoM2YTR321DJI2SdEmefoOknsq0L+byf0p6T2mbgx2bpD0l3Szp7/nv7pV5puY2b8uvddscW4+k5yvLP6cyz1tyzDMlnSVpQKeiL0dsH6nEdZukVyVtl6e1a729S9Itkl6WdEDdtMMkzcivwyrl7VpvDWOTtJ2kv0m6U9J0SR+uTDtP0n2V9bZdO2PL016pLP/KSvmE/PnPyNvDKu2MTdJuddvbC5L2y9Patd6Ol3RX/tyulbRJZdrgbW8RMeRepIPg9wKbAqsAtwNb1dU5BjgnDx8EXJKHt8r1RwETcjsrlrTZhti2B16Xh98EPFSZZyowsYPrrQe4o5d2bwTeTrp25mpgr3bGVldnG2BWB9ZbD7AtcAFwQKV8HWBW/js6D49u83rrLbbXA1vk4dcBc4G18/h51brtXm952rO9tHspcFAePgc4ut2x1X2+jwOrt3m97VZZ5tEs+T8d1O1tqO5hlNw2ZF/g/Dx8GTApZ9R9gYsjYlFE3AfMzO216lYkA44tIm6NiNq1JncCq0oaNYAYWh5bbw1K2gBYMyL+FmmrvADYr4OxHQz8bADLX67YImJ2REwHXq2b9z3ANRHxeEQ8AVwDvLed66232CLinoiYkYcfBuYD4wYQQ8tj603+vHcnff6Qtoe2rrc6BwBXR8RzA4hheWK7rrLM60nXqcEgb29DNWE0um3I+N7qRMTLwFPAmD7mLWlzsGOr2h+4NSIWVcp+nHdz/3OA3RfLG9sESbdK+qOkXSr15/TTZjtiq/kwyyaMdqy3Zudt53rrl6QdSb9m760Un5K7PM4Y4A+X5Y1tVUnTJF1f6/Ihfd5P5s9/IG22Kraag1h2e2v3ejuStMfQ17wt2d6GasIouW1Ib3WaLW/W8sSWJkpbA6cDH69M/0hEbAPskl8fbXNsc4GNI2J74Hjgp5LWLGxzsGNLE6W3Ac9FxB2V6e1ab83O28711ncD6dfnhcAREVH7Nf1F4I3AW0ndG1/oQGwbR7rVxb8CZ0rarAVttiq22nrbhnStWE1b15ukQ4CJwDf7mbcl622oJoyS24YsriNpJWAtUl9jb/O26lYkyxMbkjYELgcOjYjFv/Yi4qH89xngp6Td1rbFlrvwHssx3Ez6Jfr6XH/DyvwdWW/ZMr/22rjemp23neutVznp/wb4ckRcXyuPiLmRLAJ+TPvXW62bjIiYRToWtT3pxn9r58+/6TZbFVv2IeDyiHipEnPb1pukPYATgX0qPRGDu70tz8GZTr1IV6jPIh20rh0U2rquzrEsfYD00jy8NUsf9J5FOsjUb5ttiG3tXH//Bm2OzcMrk/pvP9Hm2MYBK+bhTYGHgHXy+E3ATiw5mLZ3O2PL4yuQ/ik27cR6q9Q9j2UPet9HOgA5Og+3db31EdsqwLXApxvU3SD/FXAm8PU2xzYaGJWHxwIzyAd+gZ+z9EHvY9oZW6X8emC3Tqw3UvK8l3zSQtu2t2Zn6JYXsDdwT15pJ+ayr5KyLcCqecOaSTo7oPpFcmKe759UzhRo1GY7YwO+DCwEbqu81gVeA9wMTCcdDP8O+cu7jbHtn5d9O3AL8IFKmxOBO3Kb3yPfQaDNn+muwPV17bVzvb2VlLAWAo8Bd1bm/bcc80xSt0+711vD2IBDgJfqtrft8rQpwN9zfD8B1mhzbO/Iy789/z2y0uam+fOfmbeHUR34THtIP5pWqGuzXevtD8C8yud2ZTu2N98axMzMigzVYxhmZtZmThhmZlbECcPMzIo4YZiZWREnDDMzK+KEYdYCkqL+bqtmw40Thg0J+bbRV3U6jj5sAPx6sBeidLv2yK8XJd0r6bRm71kk6SRJd/Rf02yJrn2mt1mnSVol0t1C+xURjwx2PBU/Br5Eugr4rXkc0n2MzAaN9zBsWJC0lqTJkuZLeibfUXdiZfoYST+TNEfpQVB3Sjqiro2pks6W9C1JC4C/5vKQdJSkn0taKGlWvulbdd7FXVJKD5sKSftLukbSc/lhN3vWzfO+/JCcFyT9SdJBeb6eft7ucxHxSEQ8EBG/IN3C+t11bX89t/28pNmSviFp1TztcOArwNaVvZXDS9ajjWxOGDbk5VuW/4Z0u+b3k+6z8ydgSr6jKKTbitySp29Nuk3I/5U0qa65Q0j32tkFOLRS/l/AFcCbgUuAH6nylLNenAKclee5CbhY0ho55o2BX+a435zrfaOpN57aeTOwM+kWH1ULSbeI2JL04KmDSLfEIcf/bdKtcTbIr0sK16ONZAO5z4lffrX7RboB3FW9TNsdeBZYra78NuDzfbR5MfDDyvhUYHqDegGcVhlfCXgOOKSuzgF5uCePf7wyfXwue2cePw24m8r9fEjdTAH09BHzVODF/H4X5fqvUHfDygbzfYL0UJ7a+EnUPUFxoOvRr5Hz8jEMGw7eAqwOLKh7PtKqwGYAklYETiA9YGk86W7Fq5C+gKtu7mUZ02sDEfFy7rLq7/ng0yvDtVtJ1+Z5I3BTRFRv5nZDP+3VXAKcDKxJet7CE5G6phbL3WOfBjYH1mDJY4j70u96tJHNCcOGgxVId+7cpcG0p/PfzwGfBY4j3U30WeBUlv3SX9jLMuq7fIL+u3Srz0mI/CVcm0cM7ME/AE9FxExY/ACdOyUdHhHn5bKdSHtPJwOfAZ4E9gG+1U+7JevRRjAnDBsObgHWA16N9LCdRt4J/DoiLoTFxz1eT/oy7YS7WfaZ5U0/bCciXpJ0KnCapEsjPed5Z+ChiPjvWr0Gx1teZNk9jpL1aCOYD3rbULKmpO3qXj2kZwP8FbhC0l6SJkh6u6STteTZ4/cAkyS9U9IbSc8DmNCRd5GcA2yWz8h6g6QPsuSRvM3uefwE9fw0AAAA8UlEQVQ0z/PJPH4PMF7SRyRtKulo4OC6eWYDm0jaQdLYfB1HyXq0EcwJw4aSXYBb617fyscB9iY9vOYHpLN/LgXewJJjB18jPXTnatKZPwuBi9oZfFVE3E96KNU+pIcEfYbUhQTwQpNtvUhKgJ+X9NqI+DXpGc9nko6j7Ek6y6vqF8BvSU/cWwAcXLgebQTzA5TMuoSk40hPVRsdEa92Oh6zej6GYdYhko4lXZ+xgPSs5f8EznOysG7lhGHWOZuTrr0YQ3p29DmkPQyzruQuKTMzK+KD3mZmVsQJw8zMijhhmJlZEScMMzMr4oRhZmZF/j9/UBl/cq724QAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "plt.hist(param_grid['learning_rate'], color = 'r', edgecolor = 'k');\n",
    "plt.xlabel('Learning Rate', size = 14); plt.ylabel('Count', size = 14); plt.title('Learning Rate Distribution', size = 18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Smaller values of the learning rate are more common with the values between 0.005 and 0.2. The width of the domain is fairly large indicating a large amount of uncertainty on our part about the optimal value (which we hope is somewhere in the grid)! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEeCAYAAACOtbLLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xm4HFWd//H3B6JA2JcAsoTAT0QRQTAKjOwQZXEQhd8Yh90lPiqCCCIMoOAoiOIIiKJBMAgMIIgYwYVdRCESFtnCToAIJBcRCFtY8p0/zmmoqnTf23etvrmf1/P0c2+fOl31PVXV9a06VV2liMDMzKxhkboDMDOzzuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODAspSSFpSt1x9IWk0ZJOkfSopNclzaw7ppGijvWm2TTrWn+H8/dmIDkx9IKkbfKKE5I+06JOSLp0qGNbyHwN+BJwAbAf8OXuKnueL0jSzMK6GpKez4n2d5IOlLTcAE9vN0nHDOQ4B4Ok5SQdI2mbumPpZKPqDmAYO1bSuRHxUt2BLIQmAHdExFfrDmSYmwUckf9fHFgN2AY4GThS0icj4urKZ5YAXu/DtHYD9gWO6cNn+zrNvlgO+Eb+/9qaY+lYPmLom+mkL1m3e7IjhaRFJY0ewFGuCjw9gOMbqZ6NiHPy62cR8c2I2I6UHBYHfiPp7cUPRMTLEfHqYAcmaQlJo4Zymu3opFjq5MTQN78Ebga+JmnFniq36reUtF8etk2h7Jhctr6kkyQ9IekFSVdJWi/X+bikWyS9lLsMJnUz7R0k3SjpRUlPSjpZ0pJN6i0r6QRJD0iaJ6lL0nmS1mkR8w6Sjpb0IPAy8B89zINRkr4m6W5JL0v6p6RfS3pPddzA2sDWhW6QY7obd29I+oSk6yXNzfNkmqQ9WtSbmrtf5kl6StIlkjas1JsmaXZjI1cZ9uEc/5cLZZL0eUk35+nPlXSNpG2bfH4fSX+T9ExeBx6SdK6kMf2ZBxHxJ+AQYCng8Mo0m/X37yLpT3kevJTnycWS3pGHX0s6Wmh8vvHaL5dNye/HSDpT0mzgBWCNVtMsTLvH9bcx/haff2Pc+Xv2cB70jUKcM7trfy7/TOE796ykyyVt0Wp6kjbP8+yFPN9+JmmpZjF2IieGvglSP/iywJGDNI2zgI2A44DvA5sBf5S0N/Aj4BLgq8C/gJ82W0mBTXK9G4BDgT8DBwJTJb2x7CUtC/wV+AJwGal//1RgO2CapLWajPtEYCJwOnAQcG8P7TkX+A6pe+OrwE+AbYEbJG2c61wH7A08BdyT/98buLiHcbdF0reA84G5wNGkjeKLwIWSvlipfgBpOU8Gvkhq55bAXyStW6h3FrAysGOTSe4DvAb8b6HsbNK8fQA4jNT1sixwhaRdC7Hulcf9MvB10tHpucB6eXr9dTYwD9i5u0qStgam5hiPJ82X04EVgcbRxrdJ6xa8ucz2Ji3PoitIR9r/Terier6HGNtaf3thBnBw/v/XhTh7Ood1AqnNrwL/Rfo+rg9cI6nZ/HsvcClwE/AVUrs/DfxPH2KuR0T41eaLdAgewKH5/eWkL+5ahToBXFr5XABTmoxvvzxsm0LZMbnst4AK5Qfm8rnA2EL5mBzDeU2mGcBulfKTc/nEStlLwEaVumsBzxVjL8R8LzC6zfk2IX/mgkqbNiRtOP9cqT8TuLYXy2WBed6kzia53nFNhl2S27l0oWzJJvXeRdqY/rhQtkIu+2Wl7tKkveKphbKP5RgmVeqOInVPPtyYP6Rk+Bwwqo/r6kzgzh7q3J7jKba7tK6SNmYBrNzDuKakzUnrYcA53Sy/KU3K2l1/u5t2tT3jctkxbdZfD5gPXA+8tVC+GvBMns+LVj4/H9isMt7LSIllqb4sz6F++Yihf74GvJW0BzTQTom8RmWNPbLfRMSjjcKI6CJtpIt7sQ33RsQllbLv5L8fg9S1AexJ2rv7h6SVGi/Shu1G4ENNxn1aRLzYZls+lv9+u9imiLidtGe1RX+7R9qwJ+lLe1axjbmdU0kb8s0Lsb0Ab3T9LJPrNeb1poV6T5OS+K4qX+mzBzCatNffsBcpsV9Smf5yeRzjeHM5Pps/v0teRoPhufx3mW7qPJv/7t6su6yXTuxl/R7X3yHwUUDAdyPilUZhRDxOSkhrARtXPnNDRNxYKbuatAMwbtAiHUBODP0QEbcC5wF7VvueB8BDlff/yn8frlbMw5qd65hRLYiIJ0h7Oo1zB2PyZz9E2vBVXxOAVZqM+77uwy9Zm7QXtUA8wJ2FOoPpXaQv+D0s2MYzcp032ilpY6VLYOeSNo6Nuu8Blq+M+xfAYpTPs+xDWi7Fy2jfRUpAs5vEcEwlhuOAR0hHM12SfpX7uZfufdNbaiSE57qpcypwK/Bj4Gm9eblrXxJ5b9YZaG/9HWyN9fKuJsMa6241lup3F+Cf+W+P5yQ7gS9X7b+jSHuHJwA79fKz3c3/VpfMtSpvtlfZ6mEbavL/laQ2tKvdo4Xq9Ooi0vzYidbz8C4ASWNJR1DPkY4G7yUdPQVwEumkbdHvSBv3fYDJ+fNbAz+JiHmVGLqA/+wmzjsBIuJ+SesD2+fX1qR+7mMlbRURD7bR5pYkLQa8A3giIua2qhcR/5T0ftL5lQnAVsAPchw7R8QN7U6zF0eYb3ykRXl1fWp14nkgtm99WXe7u9y1E74LPXJi6KeIeFjSacBBza4syZ4m9UVXDfZez/rVAklvI51IbOzVdJH2wJaJiCsHKY4HgQ+T9phvbxFjsyOhgXQ/6QTxoxHR7Mil6GOkjf+uEXFNcYDSVWjFjT0R8Zqk/yWtA+sAnyRtAIrdSI0Y3gHcGBE9nXglJ5Xf5Rf5ROdlpBOa1ZPlvbU36SjnsjbieJ10zf+1OY4NSVflHQXs0qjWz3iaaWf9hXxps6QVctdeQ7PvV2/jbCTgdxf+r8bX7AhhWHNX0sD4FmnvstUe933A5ipc6y9peWD/QY5rPUm7Vcq+lv9eAhAR80lXu3xATS7bBJDU36tgGv3ERxT7yyVtAOwKXJ/PlQyms/Pf4yQtWh1YaWNjj0+VOp8l/caimUYS2Ie00b03IqZV6vyC9J07vtkIJBW7slZqUuWW/LfZTkbb8pVG3yd1kzWNpYc47iFdrFCM4/lcv1+xVfS4/maNLqodKnUPaTLORkJuN86ppGTyVUlvaRTmBLU/qbvv1jbHNWz4iGEARMRTkr5H65PQpwLnAFdLOpt0svGzpJWq1YZmINwBnCPpdNLe6rakbq8/ka4QajgS+CDwS0m/JJ1wfoV0Ym1n0t7hfn0NIiKuyOOdCCyf++5XJe31vky64qq/3i7pqBbDfhARN0n6BnAscJukC4HHgbcB7yO18625/u9JXWVnSzqVdK7gg7nOgzT53kTErZLuIF0OuQzpssZqnYsk/Rw4QNImpPMPT5Gu59+cdPlnYy/3cknPkrq0HiOtM/uRNlJn055l82WvkI4OViOtA9sAc0hX9vS0t3u6pDVIV+A9Qvpl8CdI50p+Uah3I+lS1h9LalyBMy0i+nMk2O76ex7pnMxkSe8k9efvBCyQ1HLX2APARKXf4MwGXoiI3zYLICLuzd/tw4DrJF1Aavsk0lHlnvmIauFS92VRw+lF5XLVyrDRpA1N00snSdfuP0LqhpgBfIruL1cdV/n8OFpcZkc6xJ9ZKQvSVRM7ANNIe3izgR9SuDyxEv/RpC/jS6S9yRmkfu1NC/UWiLnNeTeKtLc3I8+Dp0l7fe9pUncmvb9ctbvXqoW6uwB/zNOfR9ro/h74fGWcW5EuUZxL6mq7DNig2bwufOaQPL3XgTW7iXdv0lVmz5ES40zS5amfKNT5LOn69ydJSfoJUpfStm3Ok5mVefBioa0HAst1My+nFN5/nLTXPCvPry7Shnn3yucWIV11NCu3P4D98rAptLictNk0+7j+bgr8Jc/Pp0i/P1muxbg/kOs2zhvN7C6WwvK4NY//ubxstmynLf353tT1alwzbWZmBvgcg5mZVTgxmJlZiRODmZmVODGYmVnJsLxcdaWVVopx48bVHYaZ2bBy8803PxURPd7OZFgmhnHjxjF9+vS6wzAzG1YkPdJOPXclmZlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlQxZYpB0pqQ5ku4slH1P0j2Sbpf068ozc83MrAZDecQwhfQEraIrgA0iYkPSwzaOGMJ4zMysiSFLDBFxHfkRfIWyyyPitfz2RtIDS8zMrEad9MvnT1F+KlOJpEmkpyYxduzYPk9k7KpjeWz2Y33+fH8svsjivDz/ZU93IZ622zwypl1nm9dcZU0effLRQZ1GRyQGSUcCr5GePdxUREwmPZWJ8ePH9/npQo/NfoxruKbnioNg2/nb1jLtkTbdOqftNo+Madfa5tnbDvo0ak8MkvYFPgJsH36cnJlZ7WpNDJJ2JD0HeOuIeLHOWMzMLBnKy1XPA24A1pM0S9KngVOBpYErJN0m6SdDFY+ZmTU3ZEcMEfHJJsVnDNX0zcysPf7ls5mZlTgxmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlTgxmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlTgxmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlTgxmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlQxZYpB0pqQ5ku4slK0g6QpJ9+e/yw9VPGZm1txQHjFMAXaslB0OXBUR6wJX5fdmZlajIUsMEXEd8HSl+KPAWfn/s4DdhioeMzNrru5zDKtExBMA+e/KrSpKmiRpuqTpXV1dQxagmdlIU3diaFtETI6I8RExfsyYMXWHY2a20Ko7McyW9DaA/HdOzfGYmY14dSeGqcC++f99gd/UGIuZmTG0l6ueB9wArCdplqRPA98BJki6H5iQ35uZWY1GDdWEIuKTLQZtP1QxmJlZz+ruSjIzsw7jxGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiUdkRgkHSzpLkl3SjpP0uJ1x2RmNlLVnhgkrQ4cCIyPiA2ARYGJ9UZlZjZy1Z4YslHAEpJGAaOBx2uOx8xsxKo9MUTEP4ATgUeBJ4BnI+Lyaj1JkyRNlzS9q6trqMM0Mxsxak8MkpYHPgqsDawGLClpr2q9iJgcEeMjYvyYMWOGOkwzsxGj9sQA7AA8HBFdEfEqcDHwbzXHZGY2YnVCYngU2EzSaEkCtgdm1ByTmdmIVXtiiIhpwEXALcAdpJgm1xqUmdkINqruAAAi4hvAN+qOw8zMOuCIwczMOosTg5mZlTgxmJlZiRODmZmVODGYmVlJ24lB0lb5XkbV8lGSthrYsMzMrC69OWK4BlihSfmyeZiZmS0EepMYBEST8hWBFwYmHDMzq1uPP3CTNDX/G8A5kuYVBi8KbAD8dRBiMzOzGrTzy+d/5r8C/gW8VBj2CnA9cPoAx2VmZjXpMTFExP4AkmYCJ0aEu43MzBZibd8rKSKOHcxAzMysM7SdGCStAHybdFvslamcuI6IZQY2NDMzq0Nv7q56BrAx6ZbYj9P8CiUzMxvmepMYtgcm5OcnmJnZQqo3v2OYAzw/WIGYmVln6E1iOBL4pqSlBisYMzOrX2+6ko4CxgFzJD0CvFocGBEbDmBcZmZWk94khosGLQozM+sY/h2DmZmV+HkMZmZW0psfuM2lm98u+AduZmYLh96cYzig8v4tpB+87U76RbSZmS0EenOO4axm5ZJuIf347YcDFZSZmdVnIM4xXAP8+wCMx8zMOsBAJIaJwFP9GYGk5SRdJOkeSTMkbT4AcZmZWR/05uTzHZRPPgtYhfQc6M/3M46TgT9ExB6S3gqM7uf4zMysj/rzA7f5QBdwbUTc09cAJC0DbAXsBxARr5CeDGdmZjXohB+4rUNKMD+XtBFwM3BQ9UlxkiYBkwDGjh07SKGYmVmvzzFI2k7SAZK+KGmbAYhhFLAJcFpEbAy8ABxerRQRkyNifESMHzNmzABM1szMmunNOYbVgV8D7yM9qAdgNUnTgY9FxOMtP9y9WcCswnMeLqJJYjAzs6HRmyOGU4DXgbdHxJoRsSawbi47pa8BRMSTwGOS1stF2wN393V8ZmbWP705+TwB2CYiHm4URMRDkg4ErupnHF8Czs1XJD0E7N/P8ZmZWR/1JjG0Mr+/I4iI24DxAxCLmZn1U2+6kq4CTpG0ZqNA0ljSbxD6e8RgZmYdojeJ4UDSD88ekvSIpJnAg7nswEGIzczMatCb3zE8BmwiaQLwTtIvn++OiCsHKzgzMxt6PR4xSNpJ0kxJywJExBUR8cOIOAW4KQ/70KBHamZmQ6KdrqQDgO9FxLPVAbnsBOCggQ7MzMzq0U5i2BDorrvoamCjgQnHzMzq1k5iGEP3l6QGsOLAhGNmZnVrJzHMIh01tLIh8I+BCcfMzOrWTmK4DPhvSUtUB0gaDXwz1zEzs4VAO5erfhvYA7hf0g+BxrMX3kU6MS3guMEJz8zMhlqPiSEi5kj6N+A0UgJQYxDwR+ALETF78EI0M7Oh1NYP3CLiEWBnScsDbyclh/sj4l+DGZyZmQ29Xt1ELyeCmwYpFjMz6wC9foKbmZkt3JwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSjomMUhaVNKtki6tOxYzs5GsYxIDcBAwo+4gzMxGuo5IDJLWAHYBflZ3LGZmI11HJAbgJOAwYH6rCpImSZouaXpXV9fQRWZmNsLUnhgkfQSYExE3d1cvIiZHxPiIGD9mzJghis7MbOSpPTEAHwR2lTQTOB/YTtI59YZkZjZy1Z4YIuKIiFgjIsYBE4GrI2KvmsMyMxuxak8MZmbWWUbVHUBRRFwLXFtzGGZmI5qPGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrKT2xCBpTUnXSJoh6S5JB9Udk5nZSDaq7gCA14BDIuIWSUsDN0u6IiLurjswM7ORqPYjhoh4IiJuyf/PBWYAq9cblZnZyFV7YiiSNA7YGJjWZNgkSdMlTe/q6hrq0MzMRoyOSQySlgJ+BXw5Ip6rDo+IyRExPiLGjxkzZugDNDMbIToiMUh6CykpnBsRF9cdj5nZSFZ7YpAk4AxgRkT8T93xmJmNdLUnBuCDwN7AdpJuy6+d6w7KzGykqv1y1Yi4HlDdcZiZWdIJRwxmZtZBnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKykIxKDpB0l3SvpAUmH1x2PmdlIVntikLQo8CNgJ2B94JOS1q83KjOzkav2xAB8AHggIh6KiFeA84GP1hyTmdmIpYioNwBpD2DHiPhMfr83sGlEHFCpNwmYlN+uB9w7pIH23krAU3UHMUDcls7ktnSuTm3PWhExpqdKo4Yikh6oSdkC2SoiJgOTBz+cgSFpekSMrzuOgeC2dCa3pXMN9/Z0QlfSLGDNwvs1gMdrisXMbMTrhMRwE7CupLUlvRWYCEytOSYzsxGr9q6kiHhN0gHAH4FFgTMj4q6awxoIw6bbqw1uS2dyWzrXsG5P7Sefzcyss3RCV5KZmXUQJwYzMytxYhggkhaVdKukS/P7tSVNk3S/pAvyifWOJ2k5SRdJukfSDEmbS1pB0hW5LVdIWr7uONsh6WBJd0m6U9J5khYfTstF0pmS5ki6s1DWdFkoOSXfVuZ2SZvUF/mCWrTle3k9u13SryUtVxh2RG7LvZI+XE/UzTVrS2HYoZJC0kr5fUcvl1acGAbOQcCMwvsTgB9ExLrAv4BP1xJV750M/CEi3glsRGrT4cBVuS1X5fcdTdLqwIHA+IjYgHRhw0SG13KZAuxYKWu1LHYC1s2vScBpQxRju6awYFuuADaIiA2B+4AjAPItcSYC786f+XG+dU6nmMKCbUHSmsAE4NFCcacvl6acGAaApDWAXYCf5fcCtgMuylXOAnarJ7r2SVoG2Ao4AyAiXomIZ0i3KDkrVxsWbclGAUtIGgWMBp5gGC2XiLgOeLpS3GpZfBT4RSQ3AstJetvQRNqzZm2JiMsj4rX89kbSb5ggteX8iJgXEQ8DD5BundMRWiwXgB8Ah1H+gW5HL5dWnBgGxkmkFWJ+fr8i8ExhpZ8FrF5HYL20DtAF/Dx3i/1M0pLAKhHxBED+u3KdQbYjIv4BnEjae3sCeBa4meG5XIpaLYvVgccK9YZb2z4F/D7/P+zaImlX4B8R8ffKoGHXFnBi6DdJHwHmRMTNxeImVYfDdcGjgE2A0yJiY+AFhkG3UTO57/2jwNrAasCSpMP6quGwXNoxXNc5JB0JvAac2yhqUq1j2yJpNHAk8PVmg5uUdWxbGpwY+u+DwK6SZpLuDLsd6QhiudyFAcPnNh+zgFkRMS2/v4iUKGY3Dn/z3zk1xdcbOwAPR0RXRLwKXAz8G8NzuRS1WhbD8tYykvYFPgLsGW/+qGq4teX/kXZA/p63A2sAt0haleHXFsCJod8i4oiIWCMixpFOmF0dEXsC1wB75Gr7Ar+pKcS2RcSTwGOS1stF2wN3k25Rsm8uGxZtIXUhbSZpdD7n02jLsFsuFa2WxVRgn3wVzGbAs40up04laUfga8CuEfFiYdBUYKKkxSStTTpx+7c6YmxHRNwREStHxLi8HZgFbJK/T8NuuQAQEX4N0AvYBrg0/78OaWV+ALgQWKzu+Npsw3uB6cDtwCXA8qRzJlcB9+e/K9QdZ5ttORa4B7gTOBtYbDgtF+A80vmRV0kbm0+3WhakLosfAQ8Cd5Cuxqq9DT205QFS//tt+fWTQv0jc1vuBXaqO/6e2lIZPhNYaTgsl1Yv3xLDzMxK3JVkZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MNqTynSf36Lnm8CfpWkmn1h1HkaRJkh6VNF/SMXXHY53JiWEhI2lK49bfHeptwG8HeyJ5oxyS9qqU7yfp+cGefifKtwn5EfA90v16TmxRb6akQ4cyNussTgzWb715pkFEPBkR8wYznoKXgW9JWmyIpjckJL2ljx9di3Q/rEsj4omIGJEJ0nrmxDDCSFpW0uT8oJG5kv4kaXxh+Ir5oTazJL2k9KCb/SvjuFbSaZJOlNQF/CWXR+6quFDSC5IearLH/kZXkqRx+f3uSg+deVHS3ZImVD6zS35gy8uSrpM0MX9uXA/NvQBYHPhiN/NjgSMISdtUHrayn6TnJe2k9GCZFyVNzfNyD6WH5jwr6WxJS1QmMUrSyZL+lV/fk7RIYVpvlXRCnt8vSLpJhQfTFGLZWdLfJL0CNH1wjaSxSg+8mZtfFyvdEh5J+wG35qoPtTn/Ws2z9SVdlqcxJ68vqxaGv1/S5ZKekvScpOslbV4Yfp6kX1XGuYikxyQdnN9L0mGSHszr4R1N1qWvS3pE0jxJT0r6RV/aYwtyYhhBJAm4jNSN8BFgY+A64Gq9eY/4xYFb8vB3kx7c81NJ21dGtxfp5/5bAvsUyr9Oun/PRqQN85mS1uohtG8Dp+TP3AScL2mpHPNY0g3wLsvDTwG+22aTnwe+CRypwtPB+mgx4BBgT9J9l8aTbjK4L7A76bkIHwG+UPncnqTv2ebA50gPa/lyYfjPga2B/wTeQ3rGwm8lbVQZzwnAUcA7gWmVYY1lewmwCulGjtuS7ip7SR52AW8+XOYDpC69x6rj6UleT64j3WbkA6SbFS4FTC0kvKVJtyDZMte5DfhdI9EC5wC7VJbJ1jmm8/L7b5Fum/FFYH3geNJ6uEuOY3fgUNL8Xpc07zv2fkrDTt335PBrYF+kp0td2mLYdqSN5RKV8tuAw7oZ5/nAzwrvrwVub1IvgOML70cBLwJ7Verskf8fl99/rjB89Vy2RX5/POkpcirU+a9cZ1w3MV8LnJpjuA/4Ti7fD3i+UK/0Ppdtk8e/UqFOAOsV6pwIvN6o02ze5xjuq8R+FOkOtpDuyjkfGFuZ/iXAjyux7N7Dcp+Q4xlXKFsnj3+H/H58T/Mt15sJHNpi2DdJT5Arli2fx/uBFp8R6d5CexXWizkU7jFEesjVH/P/SwIvAVtWxnMS8Lv8/1dI91F6S93fuYXx5SOGkeV9pCeZdeWukedzN8oGpI1U49nVRyo9n/afefjHgbGVcd1Mc7c3/on0QJwuen6wz+2F/xu3JG585p3ATZG3BtkCe8yt5BiOBA5sdKv00byIuLfwfjbwZEQ8VSmrtvXGSuw3AKsrPS1vE9JG8+7K8tiFvDwKpvcQ37uAxyNiZqMgIh4izc/1e/hsb7wP2KoSb+PIo7EOrSzpp5Luk/QsMJc0X8bmuF4jHcHsmesvRjrqOiePZ33SkesfKtP5PG/OlwtznYclnSHp/2shO5dUp1E9V7GFyCKkjdeWTYY9l/8eSuoyOYh0N8jngeNYcIP3QotpvFp5H/TcZfnGZyIiUs/HG58R/XywSURcqHSVzbHAnyuD57Pgw1Sandx9rfI+6FtbixbJn3l/k3G9VHnfan43dDefBvJOmYuQuvWaXbU0O/89i9SldTDp6GMe6U6wxYsUzgH+qvRs7k3zsF8XpgHw75Sfnwx5PkVE4/bw25O6s74PfEPSphHR07yyHjgxjCy3kL73f30QAAACw0lEQVSw8/PeZDNbAL+NiLPhjb7rdwDPDE2IC5hBehJbUV+e/3sYaeNUfVZvFzBa0jIR0UiO7+3D+FvZVJIKRw2bkfbsn5N0K2mDvmpEXNPP6dxNOhIZ1zhqkLQO6TzD3f0cd9EtwH8Aj0R6AFIzWwAHRsRlOY5VSOcP3hAR0yQ9CHySdP7lknjzKqm7SclkrYi4ulUgEfEyKUldJuk7wJOkB2dd3tfGWeLEsHBaRlJ14/YMcCXpCqLfSDqM9KyCVUknJa+MiD+T+sQ/IWkL4CngS6SnU91KPX4CfEXSicDppBPin8vD2t4Tjog/SfoDcACpL75hGmlv/HhJPyCd4K6eQO6P1YCTJP2YdHL5q6QTq0TEfZLOBaZIOoS00V2BdF7hoYi4uBfTuRL4O3CupANJCeeHeZwtN67dxd1kHZpF+h3EZ4ELJJ1ASqzrkJLFIRExl7QO7SVpGul8wXeBV5pM41zgM6RzTR9rFEbE3Ly8T8w7JteRTnBvRtqpmZyvshpFWn7PA58gHU3c34e2WoXPMSyctiRtyIuvE/Ne686kDcXppJN3vwTW482+/W+Rru74PekL+QJvPot3yEXEI6T+511JG76DSV1CkH6n0BuHU+7OICKeJvV1TyB1nU0Cju5HyFXnAouSNmCnA2cAPygM3590ZdJ3SYn6UmAr4JHeTCQv291IG+prSU+qexLYrXKOo10Hs+A6NDEiHiftlc8H/gDcRUoW8/IL4FOkDfnNpAsXziR1KVWdQ1r3ngWuqAw7GjiG1GV1Vx6+O/BwHv4M6aqlP5OukNod+HhEPIz1mx/UY8OOpINIV8csHxHz647HbGHjriTreJK+SPp9QxepO+FoYIqTgtngcGKw4eDtpN8urEjq5/4J6YjBzAaBu5LMzKzEJ5/NzKzEicHMzEqcGMzMrMSJwczMSpwYzMys5P8Az0boyZ1IVIMAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.hist(param_grid['num_leaves'], color = 'm', edgecolor = 'k')\n",
    "plt.xlabel('Learning Number of Leaves', size = 14); plt.ylabel('Count', size = 14); plt.title('Number of Leaves Distribution', size = 18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of leaves is a simple uniform domain.\n",
    "\n",
    "### Sampling from Hyperparameter Domain\n",
    "\n",
    "Let's look at how we sample a set of hyperparameters from our grid using a dictionary comprehension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'class_weight': 'balanced',\n",
       " 'boosting_type': 'dart',\n",
       " 'num_leaves': 104,\n",
       " 'learning_rate': 0.03168121522242165,\n",
       " 'subsample_for_bin': 220000,\n",
       " 'min_child_samples': 185,\n",
       " 'reg_alpha': 0.9591836734693877,\n",
       " 'reg_lambda': 0.8571428571428571,\n",
       " 'colsample_bytree': 0.6444444444444444,\n",
       " 'encoding': 'one_hot'}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Randomly sample parameters for gbm\n",
    "params = {key: random.sample(value, 1)[0] for key, value in param_grid.items()}\n",
    "params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To add a `subsample` ratio if the `boosting_type` is not `goss`, we can use an if statement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'class_weight': 'balanced',\n",
       " 'boosting_type': 'dart',\n",
       " 'num_leaves': 104,\n",
       " 'learning_rate': 0.03168121522242165,\n",
       " 'subsample_for_bin': 220000,\n",
       " 'min_child_samples': 185,\n",
       " 'reg_alpha': 0.9591836734693877,\n",
       " 'reg_lambda': 0.8571428571428571,\n",
       " 'colsample_bytree': 0.6444444444444444,\n",
       " 'encoding': 'one_hot',\n",
       " 'subsample': 0.7777777777777778}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params['subsample'] = random.sample(subsample_dist, 1)[0] if params['boosting_type'] != 'goss' else 1.0\n",
    "params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We set the subsample to 1.0 if boosting type is goss which is the same as not using any subsampling. (Subsampling is when we train on a subset of the rows (observations) rather than all of them. This technique is also referred to as bagging for \"bootstrap aggregating\")."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross Validation with Early Stopping in LightGBM\n",
    "\n",
    "The scikit-learn cross validation api does not include the option for early stopping. Therefore, we will use the LightGBM cross validation function with 100 early stopping rounds. To use this function, we need to create a dataset from our features and labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert to numpy array for splitting in cross validation\n",
    "one_hot_features = np.array(one_hot_train)\n",
    "one_hot_features_test = np.array(one_hot_test)\n",
    "\n",
    "le_features = np.array(le_train)\n",
    "le_features_test = np.array(le_test)\n",
    "\n",
    "labels = train_labels[:]\n",
    "labels_test = test_labels[:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a lgb dataset\n",
    "oh_train_set = lgb.Dataset(one_hot_features, label = train_labels)\n",
    "le_train_set = lgb.Dataset(le_features, label = train_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform cross validation with 10 folds\n",
    "r = lgb.cv(params, oh_train_set, num_boost_round = 10000, nfold = 10, metrics = 'auc', \n",
    "           early_stopping_rounds = 100, verbose_eval = False, seed = 50)\n",
    "\n",
    "# Highest score\n",
    "r_best = np.max(r['auc-mean'])\n",
    "\n",
    "# Standard deviation of best score\n",
    "r_best_std = r['auc-stdv'][np.argmax(r['auc-mean'])]\n",
    "\n",
    "print('The maximium ROC AUC in cross validation was {:.5f} with std of {:.5f}.'.format(r_best, r_best_std))\n",
    "print('The ideal number of iterations was {}.'.format(np.argmax(r['auc-mean']) + 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results Dataframe\n",
    "\n",
    "We have our domain and our algorithm which in this case is random selection. The other two parts we need for an optimization problem are an objective function and a data structure to keep track of the results (these four parts will also be required in Bayesian Optimization). \n",
    "\n",
    "Tracking the results will be done via a `dataframe` where each row will hold one evaluation of the objective function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dataframe to hold cv results\n",
    "random_results = pd.DataFrame(columns = ['loss', 'params', 'iteration', 'estimators', 'time', 'ROC AUC'],\n",
    "                       index = list(range(1, MAX_EVALS)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Objective Function\n",
    "\n",
    "The objective function will take in the hyperparameters and return the validation loss (along with some other information to track the search progress). We already choose our metric as ROC AUC and now we need to figure out how to measure it. We can't evaluate the ROC AUC on the test set because that would be cheating. Instead we must use a validation set to tune the model and hope that the results translate to the test set. \n",
    "\n",
    "A better approach than drawing the validation set from the training data (thereby limiting the amount of training data we have) is KFold cross validation. In addition to not limiting the training data, this method should also give us a better estimate of generalization error on the test set because we will be using K validations rather than only one. For this example we will use 10-fold cross validation which means testing and training each set of model hyperparameters 10 times, each time using a different subset of the training data as the validation set. The objective function will return a list of information, including the validation AUC ROC. We also want to make sure to save the hyperparameters used so we know which ones are optimal (or the best out of those we tried).\n",
    "\n",
    "In the case of random search, the next values selected are not based on the past evaluation results, but we clearly should keep track so we know what values worked the best! This will also allow us to contrast random search with informed Bayesian optimization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def random_objective(params, iteration, n_folds = N_FOLDS):\n",
    "    \"\"\"Random search objective function. Takes in hyperparameters\n",
    "       and returns a list of results to be saved.\"\"\"\n",
    "\n",
    "    encoding_type = params['encoding']\n",
    "    \n",
    "    # Handle the encoding\n",
    "    if encoding_type == 'one_hot':\n",
    "        train_set = oh_train_set\n",
    "    elif encoding_type == 'label':\n",
    "        train_set = le_train_set\n",
    "    \n",
    "    del params['encoding']\n",
    "    \n",
    "    start = timer()\n",
    "    \n",
    "    # Perform n_folds cross validation\n",
    "    cv_results = lgb.cv(params, train_set, num_boost_round = 10000, nfold = n_folds, \n",
    "                        early_stopping_rounds = 100, metrics = 'auc', seed = 50)\n",
    "    end = timer()\n",
    "    best_score = np.max(cv_results['auc-mean'])\n",
    "    \n",
    "    # Loss must be minimized\n",
    "    loss = 1 - best_score\n",
    "    \n",
    "    # Boosting rounds that returned the highest cv score\n",
    "    n_estimators = int(np.argmax(cv_results['auc-mean']) + 1)\n",
    "    \n",
    "    params['encoding'] = encoding_type\n",
    "    \n",
    "    # Return list of results\n",
    "    return [loss, params, iteration, n_estimators, end - start, best_score]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Search Implementation\n",
    "\n",
    "Now we can write a loop to iterate through the number of evals, each time choosing a different set of hyperparameters to evaluate. Each time through the function, the results are saved to the dataframe. \n",
    "\n",
    "The `%%capture` magic captures any outputs from running a cell in a Jupyter Notebook. This is useful because the output from a LightGBM training run cannot be suppressed. However, if we still want to see progress (but not all the progress that LightGBM shows) we can set the option `--no-display` which does not capture IPython display calls. Instead of `print`ing the information each iteration, we just call `display` with the same format. This is because print goes to `stdout` which is being `capture`d. Sorry if this is confusing, but there really does not appear to be any other method for suppressing the LightGBM output!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-display\n",
    "\n",
    "# Iterate through the specified number of evaluations\n",
    "for i in range(1, MAX_EVALS + 1):\n",
    "\n",
    "    # Randomly sample parameters for gbm\n",
    "    params = {key: random.sample(value, 1)[0] for key, value in param_grid.items()}\n",
    "    \n",
    "    # Handle the boosting type\n",
    "    if params['boosting_type'] == 'goss':\n",
    "        \n",
    "        # Cannot subsample with goss\n",
    "        params['subsample'] = 1.0\n",
    "    else:\n",
    "        \n",
    "        # Subsample supported for gdbt and dart\n",
    "        params['subsample'] = random.sample(subsample_dist, 1)[0]\n",
    "        \n",
    "    # Evaluate the objective function\n",
    "    results_list = random_objective(params, i)\n",
    "    \n",
    "    if i % 50 == 0:\n",
    "        # Display the information\n",
    "        display('Iteration {}: {} Fold CV AUC ROC {:.5f}'.format(i, N_FOLDS, results_list[-1]))\n",
    "    \n",
    "    # Add results to next row in dataframe\n",
    "    random_results.loc[i, :] = results_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort results by best validation score\n",
    "random_results.sort_values('loss', ascending = True, inplace = True)\n",
    "random_results.reset_index(inplace = True, drop = True)\n",
    "random_results.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "random_results.to_csv('results/random_results_kaggle.csv', index = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Search Performance\n",
    "\n",
    "As a reminder, the baseline gradient boosting model achieved a score of 0.71 on the training set. We can use the best parameters from random search and evaluate them on the testing set. \n",
    "\n",
    "What were the hyperparameters that returned the highest score on the objective function?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "random_results.loc[0, 'params']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `estimators` key holds the average number of estimators trained with early stopping (averaged over 10 folds). We can use this as the optimal number of estimators in the gradient boosting model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find the best parameters and number of estimators\n",
    "best_random_params = random_results.loc[0, 'params'].copy()\n",
    "best_random_estimators = int(random_results.loc[0, 'estimators'])\n",
    "best_random_model = lgb.LGBMClassifier(n_estimators=best_random_estimators, n_jobs = -1, \n",
    "                                       objective = 'binary', **best_random_params, random_state = 50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit on the training data\n",
    "best_random_model.fit(one_hot_features, labels)\n",
    "\n",
    "# Make test predictions\n",
    "predictions = best_random_model.predict_proba(one_hot_features_test)[:, 1]\n",
    "\n",
    "print('The best model from random search scores {:.4f} on the test data.'.format(roc_auc_score(test_labels, predictions)))\n",
    "print('This was achieved using {} search iterations.'.format(random_results.loc[0, 'iteration']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A modest gain over the baseline. Using more evaluations might increase the score, but at the cost of more optimization time. We also have to remember that the hyperparameters are optimized on the validation data whigh may not translate to the testing data. \n",
    "\n",
    "Now, we can move on to Bayesian methods and see if they are able to achieve better results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bayesian Hyperparameter Optimization using Hyperopt\n",
    "\n",
    "For Bayesian optimization in Hyperopt, we need the following four parts:\n",
    "\n",
    "1. Objective function\n",
    "2. Domain space\n",
    "3. Hyperparameter optimization algorithm\n",
    "4. History of results\n",
    "\n",
    "We already used all of these in random search, but for Hyperopt we will have to make a few changes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objective Function \n",
    "\n",
    "This objective function will still take in the hyperparameters but it will return not a list but a dictionary. The only requirement for an objective function in Hyperopt is that it has a key in the return dictionary called `\"loss\"` to minimize and a key called `\"status\"` indicating if the evaluation was successful. \n",
    "\n",
    "If we want to keep track of the number of iterations, we can declare a global variables called `ITERATION` that is incremented every time the function is called. In addition to returning comprehensive results, every time the function is evaluated, we will write the results to a new line of a csv file. This can be useful for extremely long evaluations if we want to check on the progress (this might not be the most elegant solution, but it's better than printing to the console because our results will be saved!) \n",
    "\n",
    "The most important part of this function is that now we need to return a __value to minimize__ and not the raw ROC AUC. We are trying to find the best value of the objective function, and even though a higher ROC AUC is better, Hyperopt works to minimize a function. Therefore, a simple solution is to return 1 - ROC (we did this for random search as well for practice)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "from hyperopt import STATUS_OK\n",
    "from timeit import default_timer as timer\n",
    "\n",
    "def objective(params, n_folds = N_FOLDS):\n",
    "    \"\"\"Objective function for Gradient Boosting Machine Hyperparameter Optimization\"\"\"\n",
    "    \n",
    "    # Keep track of evals\n",
    "    global ITERATION\n",
    "    \n",
    "    ITERATION += 1\n",
    "    \n",
    "    encoding_type = params['encoding']\n",
    "    \n",
    "    # Handle the encoding\n",
    "    if encoding_type == 'one_hot':\n",
    "        train_set = oh_train_set\n",
    "    elif encoding_type == 'label':\n",
    "        train_set = le_train_set\n",
    "    \n",
    "    del params['encoding']\n",
    "    \n",
    "    # Retrieve the subsample\n",
    "    subsample = params['boosting_type'].get('subsample', 1.0)\n",
    "    \n",
    "    # Extract the boosting type and subsample to top level keys\n",
    "    params['boosting_type'] = params['boosting_type']['boosting_type']\n",
    "    params['subsample'] = subsample\n",
    "    \n",
    "    # Make sure parameters that need to be integers are integers\n",
    "    for parameter_name in ['num_leaves', 'subsample_for_bin', 'min_child_samples']:\n",
    "        params[parameter_name] = int(params[parameter_name])\n",
    "    \n",
    "    start = timer()\n",
    "    \n",
    "    # Perform n_folds cross validation\n",
    "    cv_results = lgb.cv(params, train_set, num_boost_round = 10000, nfold = n_folds, \n",
    "                        early_stopping_rounds = 100, metrics = 'auc', seed = 50)\n",
    "    \n",
    "    run_time = timer() - start\n",
    "    \n",
    "    # Extract the best score\n",
    "    best_score = np.max(cv_results['auc-mean'])\n",
    "    \n",
    "    # Loss must be minimized\n",
    "    loss = 1 - best_score\n",
    "    \n",
    "    # Boosting rounds that returned the highest cv score\n",
    "    n_estimators = int(np.argmax(cv_results['auc-mean']) + 1)\n",
    "    \n",
    "    params['encoding'] = encoding_type\n",
    "    \n",
    "    if ITERATION % 100 == 0:\n",
    "        # Display the information\n",
    "        display('Iteration {}: {} Fold CV AUC ROC {:.5f}'.format(ITERATION, N_FOLDS, best_score))\n",
    "\n",
    "    # Write to the csv file ('a' means append)\n",
    "    of_connection = open(out_file, 'a')\n",
    "    writer = csv.writer(of_connection)\n",
    "    writer.writerow([loss, params, ITERATION, n_estimators, run_time, best_score])\n",
    "    of_connection.close()\n",
    "    \n",
    "    # Dictionary with information for evaluation\n",
    "    return {'loss': loss, 'params': params, 'iteration': ITERATION,\n",
    "            'estimators': n_estimators, \n",
    "            'train_time': run_time, 'status': STATUS_OK}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although Hyperopt only needs the `loss`, it's a good idea to track other metrics so we can inspect the results. Later we can compare the sequence of searches to that from random search which will help us understand how the method works. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Domain Space\n",
    "\n",
    "Specifying the domain (called the `space` in Hyperopt) is a little trickier than in grid search. In Hyperopt, and other Bayesian optimization frameworks, the domian is not a discrete grid but instead has probability distributions for each hyperparameter. For each hyperparameter, we will use the same limits as with the grid, but instead of being defined at each point, the domain represents probabilities for each hyperparameter. This will probably become clearer in the code and the images! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperopt import hp\n",
    "from hyperopt.pyll.stochastic import sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we will go through an example of the learning rate. Again, we are using a log-uniform space for the learning rate defined from 0.005 to 0.2 (same as with the grid.) This time, when we graph the domain, it's more accurate to see a kernel density estimate plot than a histogram (although both show distributions). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the learning rate\n",
    "learning_rate = {'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2))}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualize the learning rate by sampling from the space using a Hyperopt utility. Here we plot 10000 samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learning_rate_dist = []\n",
    "\n",
    "# Draw 10000 samples from the learning rate domain\n",
    "for _ in range(10000):\n",
    "    learning_rate_dist.append(sample(learning_rate)['learning_rate'])\n",
    "    \n",
    "plt.figure(figsize = (8, 6))\n",
    "sns.kdeplot(learning_rate_dist, color = 'red', linewidth = 2, shade = True);\n",
    "plt.title('Learning Rate Distribution', size = 18); plt.xlabel('Learning Rate', size = 16); plt.ylabel('Density', size = 16);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of leaves is again a uniform distribution. Here we used `quniform` which means a discrete uniform (as opposed to continuous)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Discrete uniform distribution\n",
    "num_leaves = {'num_leaves': hp.quniform('num_leaves', 30, 150, 1)}\n",
    "num_leaves_dist = []\n",
    "\n",
    "# Sample 10000 times from the number of leaves distribution\n",
    "for _ in range(10000):\n",
    "    num_leaves_dist.append(sample(num_leaves)['num_leaves'])\n",
    "    \n",
    "# kdeplot\n",
    "plt.figure(figsize = (8, 6))\n",
    "sns.kdeplot(num_leaves_dist, linewidth = 2, shade = True);\n",
    "plt.title('Number of Leaves Distribution', size = 18); plt.xlabel('Number of Leaves', size = 16); plt.ylabel('Density', size = 16);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conditional Domain\n",
    "\n",
    "In Hyperopt, we can use nested conditional statements to indicate hyperparameters that depend on other hyperparameters. For example, we know that `goss` boosting type cannot use subsample, so when we set up the `boosting_type` categorical variable, we have to se the `subsample` to 1.0 while for the other boosting types it's a float between 0.5 and 1.0 Let's see this with an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# boosting type domain \n",
    "boosting_type = {'boosting_type': hp.choice('boosting_type', \n",
    "                                            [{'boosting_type': 'gbdt', 'subsample': hp.uniform('subsample', 0.5, 1)}, \n",
    "                                             {'boosting_type': 'dart', 'subsample': hp.uniform('subsample', 0.5, 1)},\n",
    "                                             {'boosting_type': 'goss', 'subsample': 1.0}])}\n",
    "\n",
    "# Draw a sample\n",
    "params = sample(boosting_type)\n",
    "params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to set both the `boosting_type` and `subsample` as top-level keys in the parameter dictionary. We can use the Python `dict.get` method with a default value of 1.0. This means that if the key is not present in the dictionary, the value returned will be the default (1.0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve the subsample if present otherwise set to 1.0\n",
    "subsample = params['boosting_type'].get('subsample', 1.0)\n",
    "\n",
    "# Extract the boosting type\n",
    "params['boosting_type'] = params['boosting_type']['boosting_type']\n",
    "params['subsample'] = subsample\n",
    "\n",
    "params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is because the gbm cannot use the nested dictionary so we need to set the `boosting_type` and `subsample` as top level keys. Nested conditionals allow us to use a different set of hyperparameters depending on other hyperparameters. For example, we can explore different models with completely different sets of hyperparameters by using nested conditionals. The only requirement is that the first nested statement must be based on a `choice` hyperparameter (the choice could be the type of model).\n",
    "\n",
    "## Complete Bayesian Domain\n",
    "\n",
    "Now we can define the entire domain. Each variable needs to have a label and a few parameters specifying the type and extent of the distribution. For the variables such as boosting type that are categorical, we use the `choice` variable. Other variables types include `quniform`, `loguniform`, and `uniform`. For the complete list, check out the [documentation](https://github.com/hyperopt/hyperopt/wiki/FMin) for Hyperopt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the search space\n",
    "space = {\n",
    "    'encoding': hp.choice('encoding', ['one_hot', 'label']),\n",
    "    'class_weight': hp.choice('class_weight', [None, 'balanced']),\n",
    "    'boosting_type': hp.choice('boosting_type', \n",
    "                                            [{'boosting_type': 'gbdt', 'subsample': hp.uniform('gdbt_subsample', 0.5, 1)}, \n",
    "                                             {'boosting_type': 'dart', 'subsample': hp.uniform('dart_subsample', 0.5, 1)},\n",
    "                                             {'boosting_type': 'goss', 'subsample': 1.0}]),\n",
    "    'num_leaves': hp.quniform('num_leaves', 30, 150, 1),\n",
    "    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),\n",
    "    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),\n",
    "    'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),\n",
    "    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),\n",
    "    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),\n",
    "    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example of Sampling from the Domain \n",
    "\n",
    "Let's sample from the domain (using the conditional logic) to see the result of each draw. Every time we run this code, the results will change. (Again notice that we need to assign the top level keys to the keywords understood by the GBM)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample from the full space\n",
    "x = sample(space)\n",
    "\n",
    "# Conditional logic to assign top-level keys\n",
    "subsample = x['boosting_type'].get('subsample', 1.0)\n",
    "x['boosting_type'] = x['boosting_type']['boosting_type']\n",
    "x['subsample'] = subsample\n",
    "\n",
    "x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = sample(space)\n",
    "subsample = x['boosting_type'].get('subsample', 1.0)\n",
    "x['boosting_type'] = x['boosting_type']['boosting_type']\n",
    "x['subsample'] = subsample\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimization Algorithm\n",
    "\n",
    "Although this is the most technical part of Bayesian optimization, defining the algorithm to use in Hyperopt is simple. We will use the Tree Parzen Estimator (read about it [in this paper](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)) which is one method for constructing the surrogate function and choosing the next hyperparameters to evaluate. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperopt import tpe\n",
    "\n",
    "tpe_algorithm = tpe.suggest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results History\n",
    "\n",
    "The final part is the result history. Here, we are using two methods to make sure we capture all the results:\n",
    "\n",
    "1. A `Trials` object that stores the dictionary returned from the objective function\n",
    "2. Writing to a csv file every iteration\n",
    "\n",
    "The csv file option also lets us monitor the results of an on-going experiment. (Although do not use Excel to open the file while training is on-going. Instead check the results using `tail results/gbm_trials.csv` from bash or another command line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperopt import Trials\n",
    "\n",
    "trials = Trials()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Trials` object will hold everything returned from the objective function in the `.results` attribute. It also holds other information from the search, but we return everything we need from the objective. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# File to save first results\n",
    "out_file = 'results/gbm_results_kaggle.csv'\n",
    "of_connection = open(out_file, 'w')\n",
    "writer = csv.writer(of_connection)\n",
    "\n",
    "# Write the headers to the file\n",
    "writer.writerow(['loss', 'params', 'iteration', 'estimators', 'time', 'ROC AUC'])\n",
    "of_connection.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Every time the objective function is called, it will write one line to this file. Running the cell above does clear the file though."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayesian Optimization\n",
    "\n",
    "We have everything in place needed to run the optimization. First we declare the global variable that will be used to keep track of the number of iterations. Then, we call `fmin` passing in everything we defined above and the maximum number of iterations to run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperopt import fmin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "# Global variable\n",
    "global  ITERATION\n",
    "\n",
    "ITERATION = 0\n",
    "\n",
    "# Run optimization\n",
    "best = fmin(fn = objective, space = space, algo = tpe.suggest, \n",
    "            max_evals = MAX_EVALS, trials = trials, verbose = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `.results` attribute of the `Trials` object has all information from the objective function. If we sort this by the lowest loss, we can see the hyperparameters that performed the best in terms of validation loss. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort the trials with lowest loss (highest AUC) first\n",
    "trials_results = sorted(trials.results, key = lambda x: x['loss'])\n",
    "trials_results[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also access the results from the csv file (which might be easier since it's already a dataframe)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = pd.read_csv('results/gbm_results_kaggle.csv')\n",
    "\n",
    "# Sort with best scores on top and reset index for slicing\n",
    "results.sort_values('loss', ascending = True, inplace = True)\n",
    "results.reset_index(inplace = True, drop = True)\n",
    "results.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For some reason, when we save to a file and then read back in, the dictionary of hyperparameters is represented as a string. To convert from a string back to a dictionary we can use the `ast` library and the `literal_eval` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ast\n",
    "\n",
    "# Convert from a string to a dictionary\n",
    "ast.literal_eval(results.loc[0, 'params'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Best Results\n",
    "\n",
    "Now for the moment of truth: did the optimization pay off? For this problem with a relatively small dataset, the benefits of hyperparameter optimization compared to random search are probably minor (if there are any). Random search might turn up a better result in fewer iterations simply becuase of randomness! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract the ideal number of estimators and hyperparameters\n",
    "best_bayes_estimators = int(results.loc[0, 'estimators'])\n",
    "best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()\n",
    "\n",
    "# Re-create the best model and train on the training data\n",
    "best_bayes_model = lgb.LGBMClassifier(n_estimators=best_bayes_estimators, \n",
    "                                      n_jobs = -1, \n",
    "                                      objective = 'binary', \n",
    "                                      random_state = 50, \n",
    "                                      **best_bayes_params)\n",
    "best_bayes_model.fit(one_hot_features, labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate on the testing data \n",
    "preds = best_bayes_model.predict_proba(one_hot_features_test)[:, 1]\n",
    "print('The best model from Bayes optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(test_labels, preds)))\n",
    "print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to save the results, we can use the json file format. Saving the trials results will allow us to continue the hyperparameter search where we left off."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# Save the trial results\n",
    "with open('trials_kaggle.json', 'w') as f:\n",
    "    f.write(json.dumps(trials_results))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparison to Random Search\n",
    "\n",
    "Comparing the results to random seach both in numbers and figures can help us understand how Bayesian Optimization searches work. First, we can look at the best hyperparameters (as determined from the validation error) from both searches.\n",
    "\n",
    "### Optimal Hyperparameters\n",
    "\n",
    "We can compare the \"best\" hyperparameters found from both search methods. It's interesting to compare the results because they suggest there may be multiple configurations that yield roughly the same validation error. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_random_params['method'] = 'random search'\n",
    "best_random_params['iteration'] = random_results.loc[0, 'iteration']\n",
    "\n",
    "best_bayes_params['method'] = 'Bayesian optimization'\n",
    "best_bayes_params['iteration'] = random_results.loc[0, 'iteration']\n",
    "\n",
    "best_params = pd.DataFrame(best_bayes_params, index = [0]).append(pd.DataFrame(best_random_params, index = [0]), \n",
    "                                                                  ignore_index = True, sort = True)\n",
    "best_params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing Hyperparameters\n",
    "\n",
    "One interesting thing we can do with the results is to see the different hyperparameters tried by both random search and the Tree Parzen Estimator. Since random search is choosing without regards to the previous results, we would expect that the distribution of samples should be close to the domain space we defined (it won't be exact since we are using a fairly small number of iterations). On the other hand, the Bayes Optimization, if given enough time, should concetrate on the \"more promising\" hyperparameters. \n",
    "\n",
    "In addition to a more concentrated search, we expect that the average validation loss of the Bayesian Optimization should be lower than that on the random search because it chooses values likely (according to the probability model) to yield lower losses on the objective function. The validation loss should also decrease over time with the Bayesian method. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we will need to extract the hyperparameters from both search methods. We will store these in separate dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new dataframe for storing parameters\n",
    "random_params = pd.DataFrame(columns = list(random_results.loc[0, 'params'].keys()),\n",
    "                            index = list(range(len(random_results))))\n",
    "\n",
    "# Add the results with each parameter a different column\n",
    "for i, params in enumerate(random_results['params']):\n",
    "    random_params.loc[i, :] = list(params.values())\n",
    "    \n",
    "random_params['loss'] = random_results['loss']\n",
    "random_params['iteration'] = random_results['iteration']\n",
    "random_params.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new dataframe for storing parameters\n",
    "bayes_params = pd.DataFrame(columns = list(ast.literal_eval(results.loc[0, 'params']).keys()),\n",
    "                            index = list(range(len(results))))\n",
    "\n",
    "# Add the results with each parameter a different column\n",
    "for i, params in enumerate(results['params']):\n",
    "    bayes_params.loc[i, :] = list(ast.literal_eval(params).values())\n",
    "    \n",
    "bayes_params['loss'] = results['loss']\n",
    "bayes_params['iteration'] = results['iteration']\n",
    "\n",
    "bayes_params.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Learning Rates\n",
    "\n",
    "The first plot shows the sampling distribution, random search, and Bayesian optimization learning rate distributions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (20, 8))\n",
    "plt.rcParams['font.size'] = 18\n",
    "\n",
    "# Density plots of the learning rate distributions \n",
    "sns.kdeplot(learning_rate_dist, label = 'Sampling Distribution', linewidth = 4)\n",
    "sns.kdeplot(random_params['learning_rate'], label = 'Random Search', linewidth = 4)\n",
    "sns.kdeplot(bayes_params['learning_rate'], label = 'Bayes Optimization', linewidth = 4)\n",
    "plt.vlines([best_random_params['learning_rate'], best_bayes_params['learning_rate']],\n",
    "           ymin = 0.0, ymax = 50.0, linestyles = '--', linewidth = 4, colors = ['orange', 'green'])\n",
    "plt.legend()\n",
    "plt.xlabel('Learning Rate'); plt.ylabel('Density'); plt.title('Learning Rate Distribution');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Boosting Type \n",
    "\n",
    "Random search should use the boosting types with the same frequency. However, Bayesian Optimization might have decided (modeled) that one boosting type is better than another for this problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(1, 2, sharey = True, sharex = True)\n",
    "\n",
    "# Bar plots of boosting type\n",
    "random_params['boosting_type'].value_counts().plot.bar(ax = axs[0], figsize = (14, 6), color = 'orange', title = 'Random Search Boosting Type')\n",
    "bayes_params['boosting_type'].value_counts().plot.bar(ax = axs[1], figsize = (14, 6), color = 'green', title = 'Bayes Optimization Boosting Type');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Random Search boosting type percentages')\n",
    "100 * random_params['boosting_type'].value_counts() / len(random_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Bayes Optimization boosting type percentages')\n",
    "100 * bayes_params['boosting_type'].value_counts() / len(bayes_params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plots of All Numeric Hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Iterate through each hyperparameter\n",
    "for i, hyper in enumerate(random_params.columns):\n",
    "    if hyper not in ['class_weight', 'boosting_type', 'iteration', 'subsample', 'metric', 'verbose', 'loss', 'learning_rate', 'encoding']:\n",
    "        plt.figure(figsize = (14, 6))\n",
    "        # Plot the random search distribution and the bayes search distribution\n",
    "        if hyper != 'loss':\n",
    "            sns.kdeplot([sample(space[hyper]) for _ in range(1000)], label = 'Sampling Distribution', linewidth = 4)\n",
    "        sns.kdeplot(random_params[hyper], label = 'Random Search', linewidth = 4)\n",
    "        sns.kdeplot(bayes_params[hyper], label = 'Bayes Optimization', linewidth = 4)\n",
    "        plt.vlines([best_random_params[hyper], best_bayes_params[hyper]],\n",
    "                     ymin = 0.0, ymax = 10.0, linestyles = '--', linewidth = 4, colors = ['orange', 'green'])\n",
    "        plt.legend(loc = 1)\n",
    "        plt.title('{} Distribution'.format(hyper))\n",
    "        plt.xlabel('{}'.format(hyper)); plt.ylabel('Density');\n",
    "        plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evolution of Hyperparameters\n",
    "\n",
    "We can also plot the hyperparameters over time (against the number of iterations) to see how they change for the Bayes Optimization. First we will map the `boosting_type` to an integer for plotting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Map boosting type to integer (essentially label encoding)\n",
    "bayes_params['boosting_int'] = bayes_params['boosting_type'].replace({'gbdt': 1, 'goss': 2, 'dart': 3})\n",
    "\n",
    "# Plot the boosting type over the search\n",
    "plt.plot(bayes_params['iteration'], bayes_params['boosting_int'], 'ro')\n",
    "plt.yticks([1, 2, 3], ['gdbt', 'goss', 'dart']);\n",
    "plt.xlabel('Iteration'); plt.title('Boosting Type over Search');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(1, 4, figsize = (24, 6))\n",
    "i = 0\n",
    "\n",
    "# Plot of four hyperparameters\n",
    "for i, hyper in enumerate(['colsample_bytree', 'learning_rate', 'min_child_samples', 'num_leaves']):\n",
    "    \n",
    "        # Scatterplot\n",
    "        sns.regplot('iteration', hyper, data = bayes_params, ax = axs[i])\n",
    "        axs[i].scatter(best_bayes_params['iteration'], best_bayes_params[hyper], marker = '*', s = 200, c = 'k')\n",
    "        axs[i].set(xlabel = 'Iteration', ylabel = '{}'.format(hyper), title = '{} over Search'.format(hyper));\n",
    "\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_bayes_params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(1, 3, figsize = (18, 6))\n",
    "i = 0\n",
    "\n",
    "# Scatterplot of next three hyperparameters\n",
    "for i, hyper in enumerate(['reg_alpha', 'reg_lambda', 'subsample_for_bin']):\n",
    "        sns.regplot('iteration', hyper, data = bayes_params, ax = axs[i])\n",
    "        axs[i].scatter(best_bayes_params['iteration'], best_bayes_params[hyper], marker = '*', s = 200, c = 'k')\n",
    "        axs[i].set(xlabel = 'Iteration', ylabel = '{}'.format(hyper), title = '{} over Search'.format(hyper));\n",
    "\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Validation Losses\n",
    "\n",
    "Finally, we can look at the losses recorded by both random search and Bayes Optimization. We would expect the average loss recorded by Bayes Optimization to be lower because this method should spend more time in promising regions of the search space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dataframe of just scores\n",
    "scores = pd.DataFrame({'ROC AUC': 1 - random_params['loss'], 'iteration': random_params['iteration'], 'search': 'random'})\n",
    "scores = scores.append(pd.DataFrame({'ROC AUC': 1 - bayes_params['loss'], 'iteration': bayes_params['iteration'], 'search': 'Bayes'}))\n",
    "\n",
    "scores['ROC AUC'] = scores['ROC AUC'].astype(np.float32)\n",
    "scores['iteration'] = scores['iteration'].astype(np.int32)\n",
    "\n",
    "scores.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can make histograms of the scores (not taking in account the iteration) on the same x-axis scale to see if there is a difference in scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (18, 6))\n",
    "\n",
    "# Random search scores\n",
    "plt.subplot(1, 2, 1)\n",
    "plt.hist(1 - random_results['loss'].astype(np.float64), label = 'Random Search', edgecolor = 'k');\n",
    "plt.xlabel(\"Validation ROC AUC\"); plt.ylabel(\"Count\"); plt.title(\"Random Search Validation Scores\")\n",
    "plt.xlim(0.75, 0.78)\n",
    "\n",
    "# Bayes optimization scores\n",
    "plt.subplot(1, 2, 2)\n",
    "plt.hist(1 - bayes_params['loss'], label = 'Bayes Optimization', edgecolor = 'k');\n",
    "plt.xlabel(\"Validation ROC AUC\"); plt.ylabel(\"Count\"); plt.title(\"Bayes Optimization Validation Scores\");\n",
    "plt.xlim(0.75, 0.78);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It does appear that the validation ROC AUC for the Bayesian optimization is higher than that for Random Search. However, as we have seen, this does not necessarily translate to a better testing score! \n",
    "\n",
    "Bayesian optimization should get better over time. Let's plot the scores against the iteration to see if there was improvement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot of scores over the course of searching\n",
    "sns.lmplot('iteration', 'ROC AUC', hue = 'search', data = scores, size = 8);\n",
    "plt.scatter(best_bayes_params['iteration'], 1 - best_bayes_params['loss'], marker = '*', s = 400, c = 'orange', edgecolor = 'k')\n",
    "plt.scatter(best_random_params['iteration'], 1 - best_random_params['loss'], marker = '*', s = 400, c = 'blue', edgecolor = 'k')\n",
    "plt.xlabel('Iteration'); plt.ylabel('ROC AUC'); plt.title(\"Validation ROC AUC versus Iteration\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's reassuring to see that the validation ROC AUC scores of Bayesian optimization increase over time. What this shows is that the model is exploring hyperparameters that are better according to the cross validation metric! It would be interesting to continue searching and see if there is a plateau in the validation scores (there would have to be eventually). Moreover, even if validation scores continue to increase, that does not mean a better model for the testing data! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save dataframes of parameters\n",
    "bayes_params.to_csv('results/bayes_params_kaggle.csv', index = False)\n",
    "random_params.to_csv('results/random_params_kaggle.csv', index = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Continue Searching\n",
    "\n",
    "We can keep running the Bayesian hyperparameter search for more iterations to try for better results. Hyperopt will continue searching where it left off if we [pass it a trials object that already has information on previous runs](https://github.com/hyperopt/hyperopt/issues/267). This raises a good point: always save your previous results, because you never know when they will be useful! \n",
    "\n",
    "Another interesting point to not is that Bayesian Optimization methods do not have any internal state which means all they need are the results: previous inputs to the objective function and the resulting loss. Based only on these results, these methods can construct a surrogate function and suggest the next set of hyperparameters to evaluate. The internals of the objective function have no effect on the Bayesian Optimization method hence the naming of this as a black box optimization method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Continue training\n",
    "ITERATION = MAX_EVALS + 1\n",
    "\n",
    "# Set more evaluations\n",
    "MAX_EVALS = 1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "# Use the same trials object to keep training\n",
    "best = fmin(fn = objective, space = space, algo = tpe.suggest, \n",
    "            max_evals = MAX_EVALS, trials = bayes_trials, verbose = 1, \n",
    "            rstate = np.random.RandomState(50))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort the trials with lowest loss (highest AUC) first\n",
    "bayes_trials_results = sorted(bayes_trials.results, key = lambda x: x['loss'])\n",
    "bayes_trials_results[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = pd.read_csv('results/gbm_trials_kaggle.csv')\n",
    "\n",
    "# Sort values with best on top and reset index for slicing\n",
    "results.sort_values('loss', ascending = True, inplace = True)\n",
    "results.reset_index(inplace = True, drop = True)\n",
    "results.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract the ideal number of estimators and hyperparameters\n",
    "best_bayes_estimators = int(results.loc[0, 'estimators'])\n",
    "best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()\n",
    "\n",
    "# Re-create the best model and train on the training data\n",
    "best_bayes_model = lgb.LGBMClassifier(n_estimators=best_bayes_estimators, n_jobs = -1, \n",
    "                                       objective = 'binary', random_state = 50, **best_bayes_params)\n",
    "best_bayes_model.fit(features, labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate on the testing data \n",
    "preds = best_bayes_model.predict_proba(test_features)[:, 1]\n",
    "print('The best model from Bayes optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(test_labels, preds)))\n",
    "print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The continuation of the search did slightly improve the validation score (again depending on training run). Instead of training more, we might want to restart the search so the algorithm can spend more time exploring the domain space. As searching continues, the algorithm shifts from exploring (trying new values) to exploiting (trying those values that worked best in the past). This is generally what we want unless the model gets stuck in a local minimum at which point we would want to restart the search in a different region of the hyperparameter space. Bayesian Optimization of hyperparameters is still prone to overfitting, even when using cross-validation because it can get settle into a local minimum of the objective function. It is very difficult to tell when this occurs for a high-dimensional problem!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusions\n",
    "\n",
    "In this notebook, we saw how to implement automated hyperparameter tuning with Bayesian Optimization methods. We used the open-source Python library Hyperopt with the Tree Parzen Estimator to optimize the hyperparameters of a gradient boosting machine. \n",
    "\n",
    "Bayesian model-based optimization can be more efficient than random search, finding a better set of model hyperparameters in fewer search iterations (although not in every case). However, just because the model hyperparameters are better on the validation set does not mean they are better for the testing set! For this training run, Bayesian Optimization found a better set of hyperparamters according to the validation and the test data although the testing score was much lower than the validation ROC AUC. This is a useful lesson that even when using cross-validation, overfitting is still one of the top problems in machine learning. \n",
    "\n",
    "Bayesian optimization  is a powerful technique that we can use to tune any machine learning model, so long as we can define an objective function that returns a value to minimize and a domain space over which to search. This can extend to any function that we want to minimize (not just hyperparameter tuning). Bayesian optimization can be a significant upgrade over uninformed methods such as random search and because of the ease of use in Python are now a good option to use for hyperparameter tuning. As with most subjects in machine learning, there is no single best answer for hyperparameter tuning, but Bayesian optimization methods should be a tool that helps data scientists with the tedious but necessary task of model tuning! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
